## Analysis of a Car Sell Dataset Part 2

There are two goals in the second analysis task: (1), train linear regression models to predict the selling prices of cars; (2) assess the data ethics issues. There are *7* questions in this portfolio. 

The first goal involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will use another cleaned car_sells sub-dataset that **is different from** the one in “Analysis of a Car Sell Dataset” task 1. This goal covers Questions *1-6*. 

Question *7* is about data ethics issue.

In [3]:
your_name = "John xxx/Jane xxx"
your_student_id = "XXXXXXXX"

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score

import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

### Q1: Import Cleaned Car Sell Dataset
The csv file named 'car_sells_clean_data.csv' is provided. You may need to use the Pandas method, i.e., `read_csv`, for reading it. After that, please print out its total length.

In [7]:
# loads the dataset
file_path = 'car_sells_clean_data.csv'
df = pd.read_csv(file_path)

# Total Printed Out
print(f"Total length of the dataset: {len(df)}")

FileNotFoundError: [Errno 2] No such file or directory: 'car_sells_clean_data.csv'

### Q2: Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column.
* As our goal is to predict cars' selling prices given other columns, please get the correlations between year/km_driven/seller_type/fuel/owner and selling_price by using the `corr()` method.
* To get the correlations between different features, you may need to first convert the categorical features (i.e., seller_type and owner) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding selling_price. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for selling_price prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [9]:
#Use the methods, i.e., head() and info(), to have a rough picture about the data, e.g., how many columns, and the data types of each column.

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Display a summary of the dataset
print("\nDataset info:")
print(df.info())

First few rows of the dataset:


NameError: name 'df' is not defined

In [11]:
# As our goal is to predict cars' selling prices given other columns, please get the correlations between year/km_driven/seller_type/fuel/owner and selling_price by using the corr() method.

# Encoding categorical columns
df_encoded = df.copy()
df_encoded['seller_type'] = df_encoded['seller_type'].astype('category').cat.codes
df_encoded['fuel'] = df_encoded['fuel'].astype('category').cat.codes
df_encoded['owner'] = df_encoded['owner'].astype('category').cat.codes

# Calculate correlations
correlations = df_encoded[['year', 'km_driven', 'seller_type', 'fuel', 'owner', 'selling_price']].corr()
correlations = correlations['selling_price'].drop('selling_price')  # Get correlations with selling_price only

# Print the correlations
print("Correlations with selling_price:")
print(correlations)


NameError: name 'df' is not defined

In [13]:
#To get the correlations between different features, you may need to first convert the categorical features (i.e., seller_type and owner) into numerial values. For doing this, you may need to import OrdinalEncoder from sklearn.preprocessing
from sklearn.preprocessing import OrdinalEncoder

#Initialiaze OrdinalEncoder
encoder = OrdinalEncoder()

# Lists categorical columns to encode
categorical_columns = ['seller_type', 'fuel', 'owner']

# Applys the  encoding 
df_encoded = df.copy()
df_encoded[categorical_columns] = encoder.fit_transform(df[categorical_columns])

# Calculates the  correlations
correlations = df_encoded[['year', 'km_driven'] + categorical_columns + ['selling_price']].corr()
correlations_with_price = correlations['selling_price'].drop('selling_price')  # Get correlations with selling_price only

# Prints the correlations
print("Correlations with selling_price:")
print(correlations_with_price)


NameError: name 'df' is not defined

### Q3: Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set.
* To further investigate whether the size of the training/testing data affects the model performance, please randomly split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data.
* Print the shape of training and testing sets in the two cases.

### Q4: Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results.
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features regarding selling_price, respectively.
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

### Q5: Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

### Q6: Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

### Q7: Data Science Ethics
*Please read the following examples [Click here to read the example_1.](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) [Click here to read the example_2.](https://viborc.com/ethics-and-ethical-data-visualization-a-complete-guide/)

*Then view the picture ![My Image](figure_portfolio2.png "This is my image")
Please compose an analysis of 100-200 words that evaluates potential ethical concerns associated with the infographic, detailing the reasons behind these issues.
