# Summarizing Automobile Evaluation Data

In the following project I'll use what I've learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field `manufacturer_country` has been simulated for illustrative purposes. 

You can read more about the details, features, and original uses of this dataset in research on the [UCI data description page](https://archive.ics.uci.edu/ml/datasets/car+evaluation).

## Summarizing Manufacturing Country

In [1]:
import pandas as pd
import numpy as np

car_eval = pd.read_csv('car_eval_dataset.csv')
print(car_eval.head())
car_eval.manufacturer_country.value_counts().index[3]

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  


'United States'

In [2]:
# Creating a table of proportions for countries
print(car_eval.manufacturer_country.value_counts(normalize=True))

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64


## Summarizing Buying Costs

In [3]:
# Get list of categorical values for buying_cost
car_eval.buying_cost.unique()

array(['vhigh', 'med', 'low', 'high'], dtype=object)

In [4]:
# Sort Buying Cost Values and apply sorted to buying_cost column
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

In [5]:
# Converting buying_cost to category type column using ordered values
car_eval.buying_cost = pd.Categorical(car_eval.buying_cost, buying_cost_categories, ordered=True)

In [6]:
# Calculate median category of buying_cost
bc_median = np.median(car_eval.buying_cost.cat.codes)
print(bc_median)

1.0


In [8]:
bc_median_cat = buying_cost_categories[int(bc_median)]
print(bc_median_cat)

med


## Summarizing Luggage Capacity

In [10]:
print(car_eval.luggage.value_counts(normalize=True))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


In [11]:
print(car_eval.luggage.value_counts(dropna=False, normalize=True))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


9. Without passing `normalize = True` to `.value_counts()`, can you replicate the result you got in the previous exercises?

In [12]:
print(car_eval.luggage.value_counts()/len(car_eval.luggage))

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


## Summarizing Passenger Capacity

In [13]:
print(car_eval.doors.value_counts(normalize=True))

4        0.263
3        0.252
5more    0.246
2        0.239
Name: doors, dtype: float64
