# Activity 3

* Download the Activity3 lab and upload it onto Google Colab.
* Answer the Activity3 questions in Canvas and populate the cells below
* Submit Activity3 questions on Canvas and upload the PDF version of this lab:
>* To submit the this lab as PDF, go to File, click Print, then save it as PDF instead of printing it

# Business Problem

Maggie Lee is an analyst for the US Environmental Protection Agency (EPA), focusing on fuel efficiency. She has a sample of 150 vehicles that she wants to analyze, but first needs to cleanse and preprocess the data. You have been hired as an analytics consultant on the project, responsible for data cleansing decisions and tasks.

A basic overview of the **VehicleEfficiency.csv** data is below:

| Variable |Description |
| ----------- | ----------- |
|VID| unique vehicle identification number|
|make| automaker brand name|
|model| vehicle model name|
|year| model year of release|
|cityMPG| average city miles per gallon|
|highwayMPG| average highway miles per gallon|
|cylinders| number of cylinders|
|displ| displacement|
|co2TailpipeGpm| measure of tailpipe emissions|
|drive| drive type (Rear-Wheel Drive, Front-Wheel Drive, 4-Wheel or All-Wheel Drive)|
|fuelType| recommended fuel type (Regular, Premium, Diesel)|
|transmission| transmission type (Manual 5-spd, Automatic 4-spd, Manual 4-spd, Automatic 3-spd)|
|VClass| vehicle class (Two Seaters, Subcompact Cars, Minicompact Cars)|

# Import Packages

In [None]:
# do not manipluate this cell - just run it

import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

set_config(transform_output = "pandas")

# Data Import

In [None]:
# do not manipluate this cell - just run it
data = pd.read_csv('https://raw.githubusercontent.com/CHill-MSU/INFO265_Data/refs/heads/main/VehicleEfficiency.csv', index_col = 0)

data.head()

Unnamed: 0_level_0,make,model,year,cityMPG,highwayMPG,cylinders,displ,co2TailpipeGpm,drive,fuelType,transmission,VClass
VID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Alfa Romeo,Spider Veloce 2000,1985.0,19.0,25.0,4,2.0,423.190476,Rear-Wheel Drive,Regular,Manual 5-spd,Two Seaters
2,Bertone,X1/9,1985.0,20.0,26.0,4,1.5,403.954545,Rear-Wheel Drive,Regular,Manual 5-spd,Two Seaters
3,Chevrolet,Corvette,1985.0,15.0,21.0,8,5.7,522.764706,Rear-Wheel Drive,Regular,Automatic 4-spd,Two Seaters
4,Chevrolet,Corvette,1985.0,15.0,20.0,8,5.7,522.764706,Rear-Wheel Drive,Regular,Manual 4-spd,Two Seaters
5,Nissan,300ZX,1985.0,15.0,18.0,6,3.0,555.4375,Rear-Wheel Drive,Regular,Automatic 4-spd,Two Seaters


## Q1

* Run the code cell below.
* Based on the output, which of the following is not true?

In [None]:
# do not manipluate this cell - just run it

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 1 to 150
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   make            150 non-null    object 
 1   model           150 non-null    object 
 2   year            149 non-null    float64
 3   cityMPG         149 non-null    float64
 4   highwayMPG      149 non-null    float64
 5   cylinders       150 non-null    int64  
 6   displ           150 non-null    float64
 7   co2TailpipeGpm  150 non-null    float64
 8   drive           149 non-null    object 
 9   fuelType        150 non-null    object 
 10  transmission    150 non-null    object 
 11  VClass          150 non-null    object 
dtypes: float64(5), int64(1), object(6)
memory usage: 15.2+ KB


## Q2

* Run the first code cell below to create arrays of variable names based on their variable type, numeric or categorical and print the arrays.
* Which of the following code lines can be used to convert the categorical variables to category type columns in pandas?
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# do not manipluate this cell - just run it
nums = data.select_dtypes(include = 'number').columns
cats = data.select_dtypes(include = 'object').columns

print(nums)
print(cats)

Index(['year', 'cityMPG', 'highwayMPG', 'cylinders', 'displ',
       'co2TailpipeGpm'],
      dtype='object')
Index(['make', 'model', 'drive', 'fuelType', 'transmission', 'VClass'], dtype='object')


In [None]:
# Copy and paste your answer from Canvas to Here
data[cats] = data[cats].astype('category')


## Q3

* Run the first code cell below to output the unique categories of the VClass variable.
* Based on the output and the data dictionary (in the Business Problem section of this notebook), which of the following code lines can be used to address inconsistent data issues in the VClass variable?
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# do not manipluate this cell - just run it

data['VClass'].unique()

['Two Seaters', '2 Seaters', 'Minicompact Cars', 'Subcompact Cars']
Categories (4, object): ['2 Seaters', 'Minicompact Cars', 'Subcompact Cars', 'Two Seaters']

In [None]:
# Copy and paste your answer from Canvas to Here
data['VClass'] = data['VClass'].replace('2 Seaters', 'Two Seaters')


  data['VClass'] = data['VClass'].replace('2 Seaters', 'Two Seaters')


## Q4

* Run the first code cell below to output the unique categories of the fuelType variable.
* Based on the output and the data dictionary (in the Business Problem section of this notebook), how would you describe the errors that you see in this variable?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

data['fuelType'].unique()

['Regular', 'Premium', 'Automatic 4-spd', 'Diesel']
Categories (4, object): ['Automatic 4-spd', 'Diesel', 'Premium', 'Regular']

## Q5

* Which of the following code is used to apply median imputation to the numeric variables in the data dataframe?
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# Copy and paste your answer from Canvas to Here

impute_nums = SimpleImputer(strategy = "median")
data[nums] = impute_nums.fit_transform(data[nums])

## Q6

* No code needed.
* Select the right answer in Canvas.

## Q7

* No code needed.
* Select the right answer in Canvas.

## Q8

* Which of the following code is used to create a new variable named HighEmit that takes on a value of 0 for low emission vehicles and 1 for high emission vehicles?
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# Copy and paste your answer from Canvas to Here

data['HighEmit'] = data['co2TailpipeGpm'].apply(lambda x: 0 if x < 300 else 1)