# DATA 1 Practical 2 - Questions

Simos Gerasimou


## Classic Cars & Co

Classic Cars & Co is a UK company that has a large collection of classic cars from the 1980s. 

DataVision (the company you are working as a Data Scientist) has been contracted to analyse the data available for the cars and provide insights by analysing the different characteristics of the cars (e.g., speed, price). 

This Jupyter Notebook will be presented to the Classic Cars & Co main stakeholders who have limited knowledge about data science. So, your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what does this observation mean and, possibly, why it occurs. The analysis along with the explanation will help them to understand whether they need to invest more to expand their collection.

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Introduction to NumPy from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)


(2) For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

**Hint 1**: If you find difficulties in solving a task, look at Chapter 2 from the Python Data Science Handbook.

**Hint 2**: Solving each task using NumPy should require less than 10 lines of code

#### **T1) Explore the dataset and for each column write its name, data type (categorical/numerical - nominal,ordinal,discrete,continuous) and its meaning (i.e., what does it capture?)**

* You may want to open the CSV file using a text editor (e.g., Notepad) or a spreadsheet editor (e.g., Excel)

**Write your answer here (the first is given)**
* Make: Categorical (Nominal) - The model of the car
* ....

### 1) Reading dataset

The classic cars dataset is available on VLE (look for classicCars.csv in the Practicals section)

In [20]:
#Using NumPy to read the dataset
import numpy as np
#Define the path to the dataset
data_path = "ClassicCars.csv"
#Define the type of each dataset column. 
#This is needed because NumPy arrays cannot directly read files with different data types
#Hence, we are using Structured arrays. 
#But, we will soon move to Pandas which makes data manipulation easier
types = ['U20', 'U10', 'U5', 'U20', 'U3', 'f4', 'f4', 'f4', 'f4', 'U10', 'i4', 'i4', 'i4', 'i4', 'i4']
#Read the dataset
data = np.genfromtxt(data_path, dtype=types, delimiter=',', names=True)

**Structured Arrays**
* Read more about structured arrays:
  * https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
  * https://numpy.org/doc/stable/user/basics.rec.html

### Analysing the dataset


#### **Extracting the column names**

In [38]:
data.dtype.names

('make',
 'fueltype',
 'numofdoors',
 'bodystyle',
 'drivewheels',
 'wheelbase',
 'length',
 'width',
 'height',
 'numofcylinders',
 'enginesize',
 'horsepower',
 'citympg',
 'highwaympg',
 'price')

#### **Extracting the shape of the array**

In [22]:
print("The shape of the array is: ", data.shape)

The shape of the array is:  (205,)


#### **T2) What do you see?**
* How many entries does the array have?
* What does each entry include? 
* Hint: Print the elements of an entry


**Write your answer here**

.......

#### **Extracting the entries of a column given its name**

* By specifying the name of a column, you can get all the entries within the array for this column (reminder: you are using Structured Arrays)


In [39]:
#Print the entries within the 'make' column
print(np.unique(data['make']))

['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo']


#### **T3) Extract the bodystyles within the dataset**


In [45]:
#Write your answer here
np.unique(data['bodystyle']).tolist()

['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']

### How do the car prices look like?


#### **T4) Calculate the range of car prices for the entire dataset**


In [25]:
#Write your answer here
np.max(data['price'])-np.min(data['price'])

40282

#### **T5) Calculate the min, max, mean and median prices of the cars**


In [26]:
#Write your answer here
print('min: ', np.min(data['price']))
print('max: ', np.max(data['price']))
print('mean: ', np.mean(data['price']))
print('median: ', np.median(data['price']))



min:  5118
max:  45400
mean:  13300.239024390245
median:  10345.0


#### **T6) Considering the values calculated above, what insights can you extract? Where do you think the majority of car prices will be clustered?**


**Write your answer here**

.......

#### **T7) Write code to calculate the standard deviation for the car prices. Then use the corresponding NumPy function to confirm the correctness of your calculation**



In [27]:
#Write your answer here
print(np.std(data['price']))
(np.sum((data['price']-np.mean(data['price']))**2)/(len(data['price'])))**(1/2)

7969.54140103854


7969.54140103854

#### **T8) Find the details of cars with the smallest and largest car volumes**
* Hint: see how to calculate the volume of a car https://info.japanesecartrade.com/content-item/297-what-is-m3-cubic-meter-size-of-a-vehicle


In [28]:
#Write your answer here
volume = data['length']*data['height']*data['width']
miN = np.min(volume)
maX = np.max(volume)
cmin = np.argmin(volume)
cmax = np.argmax(volume)
print("Min Vol: ", miN, " :", data[cmin])
print("Max Vol: ", maX, " :", data[cmax])

Min Vol:  452643.2  : ('chevrolet', 'gas', 'two', 'hatchback', 'fwd', 88.4, 141.1, 60.3, 53.2, 'three', 61, 48, 47, 53, 5151)
Max Vol:  846007.7  : ('mercedes-benz', 'gas', 'four', 'sedan', 'rwd', 120.9, 208.1, 71.7, 56.7, 'eight', 308, 184, 14, 16, 40960)


#### **T9) Find the different types of bodystyles for the cars in the dataset**

* Hint: You may want to check: https://numpy.org/doc/stable/reference/generated/numpy.unique.html

In [46]:
#Write your answer here
np.unique(data['bodystyle']).tolist()

['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']

#### **T10) Find the number of different car *brands* (makes)**


In [30]:
#Write your answer here
len(np.unique(data['make']))

22

#### **T11) Find the engine size and the horsepower for the most and least efficient cars when driven in the city and the highway (i.e., the cars with the smallest and largest difference in fuel consumption when driven in the city and the highway)**

In [31]:
#Write your answer here
eff = data['highwaympg'] - data['citympg']
leff = np.argmin(eff)
heff = np.argmax(eff)
print("Lest: ", data['enginesize'][leff], " ", data['horsepower'][leff])
print("Most: ", data['enginesize'][heff], " ", data['horsepower'][heff])

Lest:  152   95
Most:  203   288


#### **T12) Find the make with the largest number of cars and how many they are**

In [32]:
#Write your answer here
makes,counts = np.unique(data['make'], return_counts=True)
most = np.argmax(counts)
print(makes[most]," : ", counts[most])

toyota  :  32


#### **T13) Find how many cars have a wheel base greater than 100**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [33]:
#Write your answer here
np.count_nonzero(data['wheelbase']>100)

63

#### **T14) Find if there are any convertible cars that cost less than Â£15000**

* Hint: See https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html

In [48]:
#Write your answer here
data[(data['bodystyle'] == 'convertible') & (data['price']<15000)]['make'].tolist()

['alfa-romero', 'volkswagen']

#### **T15) Calculate the interquartile range for the price of all cars**

In [35]:
#Write your answer here
np.percentile(data['price'], 75) - np.percentile(data['price'], 25)

8715.0

#### **T16) Calculate the 50th percentile range for the horsepower of all cars. Which value is the 50th percentile equal to?**

In [36]:
#Write your answer here
print(np.percentile(data['horsepower'], 50))
np.median(data['horsepower'])

95.0


95.0

### Ideas for practicing further at home

* Find the engine and horsepower of 4wd cars
* Find whether diesel or gas cars are more efficient in the city/highway
* Any other analysis that you might could generate some useful insight.
