## Project: Summarizing Automobile Evaluation Data

This project is part of the course 'Master Statistics With Python' from Codecademy (www.codecamy.com)

In the following project we’ll analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used for to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the UCI data description page (https://archive.ics.uci.edu/ml/datasets/car+evaluation)

In [2]:
import pandas as pd
import numpy as np

car_eval = pd.read_csv('car_eval_dataset.csv')
print(car_eval.head())

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  


## Summarizing Manufacturing Country

In [3]:
#Creating a table of frequencies of all cars reviewed by manufacturer_country:

table_of_frequencies = car_eval['manufacturer_country'].value_counts()
print(table_of_frequencies)

Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64


In [4]:
#Calculating a table of proportions for countries that appear in manufacturer_country in the dataset:

table_of_proportions = (car_eval['manufacturer_country'].value_counts(normalize = True))*100
print(table_of_proportions)

Japan            22.8
Germany          21.8
South Korea      15.9
United States    13.8
Italy             9.7
France            8.7
China             7.3
Name: manufacturer_country, dtype: float64


As we can see in the results above: 22.8% of cars were manufactured in Japan; 21.8% in Germany; 15.9% in South Korea; 13.8% in the Unites States; 9.7% in Italy; 8.7% in France; and 7.3% in China.

## Summarizing Buying Costs

In [5]:
#Analyzing the column buying_cost (which describes the cost of buying any car in the dataset),let's inpsect the unique values for this variable:

print(list(car_eval['buying_cost'].unique()))

['vhigh', 'med', 'low', 'high']


As we can see above, the column buying_cost is an ordinal categorical variable, which means it's possible to create an order associated with the values in the data and perform additional numeric operations on the variable.

In [6]:
#Ordering from lowest to highest:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

#Converting to type 'category':
car_eval['buying_cost'] = pd.Categorical(car_eval['buying_cost'], buying_cost_categories, ordered = True)

##Calculating the median:
median_index = np.median(car_eval['buying_cost'].cat.codes)
median_category = buying_cost_categories[int(median_index)]
print(median_category)

med


As we can see, the cost of buying any car in this dataset that most appear is 'med'.

## Summarizing Luggage Capacity

In [9]:
#The column luggage is a categorical variable that records the luggage capacity for each reviewed car. Let's calculate a table of proportions for this variable:

table_of_proportions_luggage = (car_eval['luggage'].value_counts(normalize = True))*100
print(table_of_proportions_luggage)

small    33.9
med      33.3
big      32.8
Name: luggage, dtype: float64


33.9% of luggage capacity is small; 33.3% of luggage capacity is medium; 32.8% of luggae capacity is big.

In [8]:
#Analyzing if there are any missing values in the luggage column:

print(car_eval['luggage'].value_counts(dropna = False, normalize = True))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


As the result was the same, we can assume there are no missing values.

## Summarizing Passenger Capacity

In [10]:
#The column doors is a categorical variable that records the count of doors for each reviewed car.
#Let's find the count of cars that have 5 or more doors:

frequency = np.sum(car_eval.doors == '5more')
print(frequency)

246


There are 246 cars in this dataset that have 5 or more doors.

In [11]:
#Proportion of cars that have 5+ doors:
proportion = (car_eval.doors == '5more').mean()
print(proportion)

0.246


24.6% of all cars in this dataset have 5 or more doors.