In the following project you’ll use what you’ve learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used for to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
car_eval = pd.read_csv('/Users/elorm/Documents/Repos/Datasets/car_eval_dataset.csv')
print(car_eval.head())

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  


In [3]:
modal_manufacturer_country = car_eval['manufacturer_country'].value_counts()
modal_manufacturer_country = modal_manufacturer_country.index[0]
print(modal_manufacturer_country)


Japan


In [4]:
car_eval['manufacturer_country'].value_counts()[0]

228

In [5]:
#Printing the tables of proportions
table_of_proportions = car_eval['manufacturer_country'].value_counts( normalize = True)
print(table_of_proportions)

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64


In [6]:
#Looking at all the unique buying cost categories
print(car_eval['buying_cost'].unique())

['vhigh' 'med' 'low' 'high']


In [7]:
buying_cost_order = ['low', 'med', 'high', 'vhigh']

car_eval['buying_cost'] = pd.Categorical(car_eval['buying_cost'], buying_cost_order, ordered = True)

In [8]:
median_category_num = np.median(car_eval['buying_cost'].cat.codes)
median_category = buying_cost_order[int(median_category_num)]
print(median_category)

med


In [9]:
luggage_proportions = car_eval['luggage'].value_counts(dropna = False, normalize = True)
print(luggage_proportions)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


In [10]:
luggage_proportions = car_eval['luggage'].value_counts(normalize = True)
print(luggage_proportions)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


In [11]:
number_of_doors = car_eval['doors'].value_counts()
print(number_of_doors)

4        263
3        252
5more    246
2        239
Name: doors, dtype: int64
