# Summarizing Automobile Evaluation Data
In the following project, we’ll use what we’ve learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the UCI data description page.

In [2]:
import pandas as pd
import numpy as np

In [3]:
#Loading data 
car_eval = pd.read_csv('car_eval_dataset.csv')
#Checking the head of the dataset
print(car_eval.head())

  buying_cost maintenance_cost doors capacity luggage safety acceptability  \
0       vhigh              low     4        4   small    med         unacc   
1       vhigh              med     3        4   small   high           acc   
2         med             high     3        2     med   high         unacc   
3         low              med     4     more     big    low         unacc   
4         low             high     2     more     med   high           acc   

  manufacturer_country  
0                China  
1               France  
2        United States  
3        United States  
4          South Korea  



## Summarizing Manufacturer Country
*manufacturer_country* is a nominal categorical variable that indicates the country of the manufacturer of each car reviewed. Let's create a table of frequencies of all the cars reviewed by manufacturer_country. What is the modal category? Which country appears 4th most frequently? 

In [4]:
manufacturer_country_frequency = car_eval.manufacturer_country.value_counts()
print(manufacturer_country_frequency)

Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: manufacturer_country, dtype: int64


**Results:** The modal category is Japan, the 4th most frequent manufacturer is United States

Let's calculate a table of proportions for countries that appear in manufacturer_country in the dataset. What percentage of cars were manufactured in Japan?



In [5]:
manufacturer_country_proportions = car_eval.manufacturer_country.value_counts(normalize = True)
print(manufacturer_country_proportions)

Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: manufacturer_country, dtype: float64


**Results:** 22.8 % of the cars are manufactured in Japan

## Summarizing Buying Costs

*buying_cost* is a categorical variable which describes the cost of buying any car in the dataset. Let's check out a list of the possible values for this variable.

In [6]:
print(car_eval.buying_cost.unique())

['vhigh' 'med' 'low' 'high']


*buying_cost* is an ordinal categorical variable, which means we can create an order associated with the values in the data and perform additional numeric operations on the variable. We will create a new list, *buying_cost_categories*, that contains the unique values in buying_cost, ordered from lowest to highest.

In [7]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

In [8]:
# Converting the buying cost column into a Categorical column

car_eval.buying_cost = pd.Categorical(car_eval.buying_cost, buying_cost_categories, ordered = True)

In [9]:
# Calculating the median using cat codes and the categorical buying costs column
buying_cost_index = np.median(car_eval.buying_cost.cat.codes)
print(buying_cost_index)
buying_cost_median = buying_cost_categories[int(buying_cost_index)]
print(buying_cost_median)

1.0
med


**Results:** The median buying cost for an automobile is med

## Summarizing Luggage Capacity

*luggage* is a categorical variable in the car evaluations dataset that records the luggage capacity for each reviewed car. Let's calculate a table of proportions for this variable and print the result.

In [10]:
luggage_proportions = car_eval.luggage.value_counts(dropna = False ,normalize = True)
print(luggage_proportions)

small    0.339
med      0.333
big      0.328
Name: luggage, dtype: float64


**Results:** There is not a big difference between luggage capacity in automobiles in their three categories (small, median & big)

## Summarizing Passenger Capacity
*doors* is a categorical variable in the car evaluation dataset that records the number of doors for each reviewed car. Let's find the number of cars that have 5 or more doors. We can identify cars with 5+ doors by looking for cars that have a value of '5more' in the doors column. Print your result.

In [11]:
frequency = np.sum(car_eval["doors"] == '5more')
print(frequency)
proportion = np.mean(car_eval["doors"] == '5more')
print(proportion)

246
0.246


**Results:** There is a proportion of 24.6 % of cars who have more than 5 doors. 