# MPG dataset analysis and prediction

Your task is to analyze the mpg dataset using techniques learned in the course and create a multiple linear regression model to predict `mpg` (miles per gallon).

The notebook consists of two parts:

  1. Analyzing the data (summary statistics and graphical analysis).
  2. Creating a multiple linear regression model.
  
In comparison with the course, there are two new concepts:
  1. Dummy variables for categorical attributes.
  2. Mean absolute error (MAE).
  
These concepts will be explained in the appropriate sections.





### Data Description

The data we are using is modified [auto mpg](https://archive.ics.uci.edu/ml/datasets/auto+mpg) dataset taken from UCI repository.

Information regarding data<br>
&emsp;&emsp;&emsp;&emsp;**Title:** Auto-Mpg Data<br>
&emsp;&emsp;&emsp;&emsp;**Number of Instances:** 398<br>
&emsp;&emsp;&emsp;&emsp;**Number of Attributes:** 7 <br>
&emsp;&emsp;&emsp;&emsp;**Attribute Information:**

    1. mpg:           continuous
    2. displacement:  continuous
    3. horsepower:    continuous
    4. weight:        continuous
    5. model year:    multi-valued discrete
    6. origin:        multi-valued discrete
    
All the attributes are self-explanatory, except (maybe) displacement. ([definition](https://en.wikipedia.org/wiki/Engine_displacement)).

### Tasks

1. Load CSV data into a pandas DataFrame.
2. Explore data:
   * Specify which variables are numerical and which are categorical variables.
   * Calculate number of unique values for each variable.
   * Detect missing values (`NaN` values).
3. If there are any `NaN` values, filter them out.






### Hints

1. As there are too few observations for the `model_year` variable, it can be analyzed as a categorical variable.
2. A useful method for the description of data types is `.info()`.


In [None]:
# import all necessary libraries
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

We will first import the data into a pandas dataframe and inspect it's properties.

In [None]:
# Load the dataset from the
df = pd.read_csv("https://gist.githubusercontent.com/Ruzejjur/7c3507e8e99a1013658db5f5eace3d33/raw/a86c1ac1e377fa9609109816df8942f13d2f327c/gistfile1.txt")

# Display the first few rows of the dataset
df.head()


Unnamed: 0,mpg,displacement,horsepower,weight,model_year,origin
0,18.0,307.0,130.0,3504,70,usa
1,15.0,350.0,165.0,3693,70,usa
2,18.0,318.0,150.0,3436,70,usa
3,16.0,304.0,150.0,3433,70,usa
4,17.0,302.0,140.0,3449,70,usa


# Analysis of variables

Analyze the categorical and numerical variables separately.

## Analysis of categorical variables


### Tasks

1. Subset the dataset only for the categorical variables.
2. Explore unique values of these variables.
3. Calculate summary statistics for those categorical variables which are numeric:
   * min
   * max
   * mean
   * median
   * variance
   * standard deviation
4. Create graphical analysis of the categorical variables:
   * Create a count plot for each categorical variable (use `sns.countplot()`).
5. Describe the insights from the analysis:
   * Is the dataset balanced with respect to individual categories?


### Hints

1. A useful method for the summary statistics is `.describe()` from the pandas package.
2. We would like the data to be evenly represented across categories (i.e., uniformly distributed). Are the data uniformly distributed across the categories?


## Analysis on Numerical Attributes

### Tasks

1. Subset the dataset only for the numerical variables.
2. Calculate summary statistics for these variables:
   * min
   * max
   * mean
   * median
   * variance
   * standard deviation
3. Create graphical analysis of the numeric variables:
   * Create one (or more) of the following plots for each numeric variable:
      * Histogram (`sns.histplot()`)
      * Box plot (`sns.boxplot()`)
      * Violin plot (`sns.violinplot()`)
4. Analyze the relationships between individual numerical variables:
   * Use `sns.pairplot()`.
5. Describe the insights from the analysis:
   * Describe the distribution of individual numeric variables.
   * Are there any linear relationships between the numeric variables?
     * For example: If the weight increases, does the horsepower increase/decrease?


### Hints

1. A useful method for the summary statistics is `.describe()` from the pandas package.


## Analysis of categorical vs. numerical variables

### Tasks

1. Work with the original dataset (containing both categorical and numeric variables).
2. Explore the relationship between categorical variables and numeric variables:
   * Create boxen plots of categorical vs. numerical variables for each variable (use `sns.boxenplot()`).
3. (Optional) Create a violin plot of categorical vs. numerical variables for each variable (use `sns.violinplot()`).
4. Create a line plot of `model_year` vs. individual numerical variables.
5. Describe what you found from the plots in tasks 2, 3, and 4.


# Prediction model

### Tasks

Create a multivariable regression model based on the variables in the dataset for prediction of `mpg` based on all other variables.

1. Separate the dependent and independent variables:
   * Name the independent variables dataframe `X`.
   * Name the dependent variable `Y`.
2. Convert any categorical variables into dummy variables (for explanation see hints) using (you can copy the code into the code block):

```python
# Convert the categorical variables into dummy variables

# Define names of columns to be converted to dummy varaibles
cat_columns_to_be_converted = np.array([])

# Modify the X dataframe of independent variables
X = pd.get_dummies(X, columns=cat_columns_to_be_converted, drop_first=True)

# Convert dummy columns to int as the sm.OLS() requires numerical variables

# For each column in the dataframe
for col in X.columns:
  # Check if the column type is boolean
    if X[col].dtype == 'bool':
      # if yes make it 0 or 1
        X[col] = X[col].astype(int)


3. Analyze the results:
   * What is the R²? Is it high or low?
   * Are all of the variables statistically significant?
4. Create the residual plot:
   * Are the residuals randomly distributed around 0? Are there any patterns in the residual plot?


## Testing model accuracy

## DO NOT MODIFY THIS CODE!!!

Just uncomment the code and run it after you are finished with the previous sections.


In this section, the model you trained above is going to be evaluated using repeated random Train-Test dataset splitting.

The dataset is going to be split into two parts:
1. *Train dataset* - This is the dataset on which the model is going to be trained.
2. *Test dataset* - This is the set of data which the model 'has not seen yet' and it is used to test how well the model predicts `mpg` with new data.

The splitting is done 1000 times, randomly assigning data to Train and Test datasets. The split is as follows:
1. 80% is assigned to the Train dataset.
2. 20% is assigned to the Test dataset.

On each iteration, Mean Absolute Error (**MAE**) is calculated.

**MAE** is defined as the average absolute difference between the actual values and the predicted values.

Mathematically, it is given by:

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

So if MAE is 0, this means that the model predicts the values exactly. In reality, MAE = 0 is not achievable, so our goal is to have MAE as low as possible.


### Tasks

Run the code below and describe the results.

How good is our model for the prediction of `mpg` based on other variables?
i.e. How precise is your model in predicting the `mpg`?


### Hints

It is crucial to name the independent variables dataset as `X` and the dependent variable as `Y` for the code below to function.


In [None]:
# import statistics
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_absolute_error
# import statsmodels.api as sm

# model_accuracies = []

# for repetition in range(1000):
#     # Split the dataset into training and testing sets
#     (training_inputs, testing_inputs, training_values, testing_values) = train_test_split(X, Y, test_size=0.2)

#     # Add a constant to the training inputs
#     training_inputs = sm.add_constant(training_inputs)
#     testing_inputs = sm.add_constant(testing_inputs)

#     # Train the model using the training set with sm.OLS
#     model = sm.OLS(training_values, training_inputs).fit()

#     # Predict the values on the testing set
#     predictions = model.predict(testing_inputs)

#     # Calculate the mean absolute error on the testing set
#     mae = mean_absolute_error(testing_values, predictions)
#     model_accuracies.append(mae)

# # Print the results
# print("Min. model MAE", min(model_accuracies))
# print("Max. model MAE", max(model_accuracies))
# print("Mean model MAE", statistics.mean(model_accuracies))

# print("Median model MAE", statistics.median(model_accuracies))

# # Plot the model accuracies
# x = np.array(range(1000))
# plt.plot(x, model_accuracies)
# plt.xlabel('Repetition')
# plt.ylabel('Mean Absolute Error (MAE)')
# plt.title('Model MAE over 1000 Repetitions')
# plt.show()
