## Automobile Dataset (UCI Machine Learning Repositoryt)

For this week, we'll be doing an exploratory data analysis of the [automobile dataset from UCI](https://archive.ics.uci.edu/dataset/10/automobile). Click the link for the dataset to check out more information about the dataset!

Break off into groups of two or three and complete the following.

### Step 0: Establishing the question

This data analysis is *exploratory*, so you might not initially have a question in mind. In this case, we want to characterize what we have in the most comprehensive way we can.

### Step 1: Environment Setup

Once you've cloned this repository, I highly recommend you set up a virtual environment for running the code. To do that you need to set up Python, then create a new virtual environment with a command like `python -m venv virtualenv`, which will create a virtual environment called virtualenv. Make sure you select that virtual environment as the kernel for your Jupyter Notebook session (top right of the window). Then, you can run the next cell.

In [29]:
# Run this cell to install the required packages:
!pip install pandas numpy matplotlib



In [None]:
# Run this cell to import the necessary libraries for use in this notebook:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Step 1: Reading the Data

In this step you will use the Python package pandas to read in your data. After reading the data, we'll do a few standard operations that are typically done as sanity checks and initial data characterizations.

In [None]:
# In this cell, define the path to the .csv file, and load it into a pandas DataFrame called `df`.


In [None]:
# Print the first 5 rows of the DataFrame to verify that the data has been loaded correctly (using the `.head()` method).


   symboling normalized-losses         make fuel-type aspiration num-of-door  \
0          3                 ?  alfa-romero       gas        std         two   
1          3                 ?  alfa-romero       gas        std         two   
2          1                 ?  alfa-romero       gas        std         two   
3          2               164         audi       gas        std        four   
4          2               164         audi       gas        std        four   

    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \
0  convertible          rwd           front        88.6  ...          130   
1  convertible          rwd           front        88.6  ...          130   
2    hatchback          rwd           front        94.5  ...          152   
3        sedan          fwd           front        99.8  ...          109   
4        sedan          4wd           front        99.4  ...          136   

   fuel-system  bore  stroke compression-ratio horsepowe

In [None]:
# List the column names of the DataFrame (using the `.columns` attribute).


Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-door', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')


In [30]:
# Print the number of rows and columns in the DataFrame (you can use the `.shape` attribute).


### Step 1: Data Cleaning

You will notice that a few of the 

In [None]:
# Count the number of missing values in each column (using the `.isnull().sum()` method).

In [None]:
# Drop the columns with more than 40 missing values

### Step 2: Data Exploration & Basic Statistics

Often we want to know specific information about the data

In [None]:
# Print the average and standard deviation of the price of the cars in the dataset.

In [None]:
# Print the maximum horsepower of the cars in the dataset.

In [None]:
# Print the number of car makes in the dataset.

In [None]:
# Print the type of the wheel-base column.

Data Questions

In [None]:
# Which car make has the highest average price?

# Which car make has the highest average horsepower?

### Step 3: Data Visualization

Here we'll visualize

In [12]:
# Plot a histogram of the 'price' column to visualize the distribution of car prices.

In [None]:
# Make a scatter plot to visualize the relationship between 'horsepower' and 'price'.


In [28]:
# Make a scatter plot to visualize the relationship between 'engine-size' and 'price'.


In [27]:
# Make a scatter plot to visualize the relationship between 'peak-rpm' and 'engine-size'.


In [None]:
# Make another plot of your choice to visualize some aspect of the data.

### Building Intuitions

The goal of this notebook is to ask questions about the data and let it tell us the answers. You might notice that there are some predictive relationships in the data (e.g., engine size is closely related to price). In the next notebook, we'll exploit these relationships to build statistical models that learn based on observed data.