# Summarizing Automobile Evaluation Data

In the following project you'll use what you've learned about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. Originally, this dataset was used to train a classification model that assigned an acceptability score/category to cars based on these attributes.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field `manufacturer_country` has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the [UCI data description page](https://archive.ics.uci.edu/ml/datasets/car+evaluation).

## Summarizing Manufacturing Country

1. `manufacturer_country` is a _nominal categorical variable_ that indicates the country of the manufacturer of each car reviewed. Create a table of frequencies of all the cars reviewed by `manufacturer_country`. What is the modal category? Which country appears 4th most frequently? Print out your results.

In [2]:
import pandas as pd

car_eval = pd.read_csv('car_eval_dataset.csv')
car_eval.manufacturer_country.value_counts()


manufacturer_country
Japan            228
Germany          218
South Korea      159
United States    138
Italy             97
France            87
China             73
Name: count, dtype: int64

2. Calculate a table of proportions for countries that appear in `manufacturer_country` in the dataset. What percentage of cars were manufactured in Japan?

In [3]:
proportions = car_eval.manufacturer_country.value_counts(normalize=True)
print(proportions)

manufacturer_country
Japan            0.228
Germany          0.218
South Korea      0.159
United States    0.138
Italy            0.097
France           0.087
China            0.073
Name: proportion, dtype: float64


## Summarizing Buying Costs

3. `buying_cost` is a categorical variable which describes the cost of buying any car in the dataset. Print out a list of the possible values for this variable.

In [4]:
car_eval.buying_cost.unique()

array(['vhigh', 'med', 'low', 'high'], dtype=object)

4. `buying_cost` is an _ordinal categorical variable_, which means we can create an order associated with the values in the data and perform numeric operations on the variable. Create a new list, `buying_cost_categories`, that contains the unique values in `buying_cost`, ordered from lowest to highest.

In [5]:
buying_cost_categories = ['low','med','high','vhigh']

car_eval.buying_cost = pd.Categorical(car_eval.buying_cost, categories=buying_cost_categories,ordered=True)

5. Convert `buying_cost` to type `'category'` using the list you created in the previous exercise.

In [6]:
car_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   buying_cost           1000 non-null   category
 1   maintenance_cost      1000 non-null   object  
 2   doors                 1000 non-null   object  
 3   capacity              1000 non-null   object  
 4   luggage               1000 non-null   object  
 5   safety                1000 non-null   object  
 6   acceptability         1000 non-null   object  
 7   manufacturer_country  1000 non-null   object  
dtypes: category(1), object(7)
memory usage: 56.0+ KB


6. Calculate the median category of the `buying_cost` variable and print the result.

In [7]:
median_index = car_eval.buying_cost.cat.codes.median()
median_category = car_eval.buying_cost.cat.categories[int(median_index)]
print(median_category)

med


## Summarizing Luggage Capacity

7. `luggage` is a categorical variable in the car evaluations dataset that records the luggage capacity for each reviewed car. Calculate a table of proportions for this variable and print the result.

In [10]:
car_eval.luggage.value_counts(normalize=True)

luggage
small    0.339
med      0.333
big      0.328
Name: proportion, dtype: float64

8. Are there any missing values in this column? Replicate the table of proportions from the previous exercise, but do not drop any missing values from the count. Print the result.

In [11]:
car_eval.luggage.value_counts(dropna=False, normalize=True)

luggage
small    0.339
med      0.333
big      0.328
Name: proportion, dtype: float64

9. Without passing `normalize = True` to `.value_counts()`, can you replicate the result you got in the previous exercises?

In [13]:
print(car_eval.luggage.value_counts()/len(car_eval.luggage))

luggage
small    0.339
med      0.333
big      0.328
Name: count, dtype: float64


## Summarizing Passenger Capacity

10. `doors` is a categorical variable in the car evaluations dataset that records the count of doors for each reviewed car. Find the count of cars that have 5 or more doors. You can identify cars with 5+ doors by looking for cars that have a value of `'5more'` in the `doors` column. Print your result.

In [21]:
print(car_eval.doors[car_eval.doors == '5more'].count())

246


11. Find the proportion of cars that have 5+ doors and print the result.

In [23]:
car_eval.doors.value_counts(normalize=True)

doors
4        0.263
3        0.252
5more    0.246
2        0.239
Name: proportion, dtype: float64