In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Introduction

For this week's notebook exercise we'll be training our fundamentals. Note, one of my goals for the class was to ensure that everyone had seen the code for subsetting a dataframe a thousand times within the semester. Though it may seem mundane, operations like subsetting are so essential to data wrangling, underpin so many other complex manipulations, that you can seldom get "too good" at them. The tricky part is to keep things interesting.

# Assignment

Let's pretend that you work for an automobile regulatory agency. One of the key questions you have is the relationship between city gas mileage and highway gas mileage. Generally, city gas mileage is consistently lower than highway gas mileage, just given the necessity of stopping-and-starting with traffic lights, stop signs, crosswalks, etc. But *across* cars, city mileage should track closely to highway mileage, gas efficiency being an intrinsic part of a car's design.

Run the code chunk below to get your dataset (note I've tampered with this otherwise real data set for the sake of this assignment), and answer the following questions:

1) Create a new variable called `hwylesscty` that's defined as the highway mileage minus the city mileage for each car. Highway mileage is given by the variable `hwy`, and city mileage is given by the variable `cty` What's the $IQR$ for this variable? (Recall, $IQR$ is the 3rd quartile minus the 1st quartile)

2) How many cars are outliers with regard to `hwylesscty`? (We'll define outliers as values that are $1.5^*IQR$ greater than the 3rd quartile, or $1.5^*IQR$ less than the 1st quartile.)

3) What's the correlation between highway mileage and city mileage when considering all observations?

4) What's the correlation between highway mileage and city mileage when considering only non-outliers?

5) Calculate (but don't report) the correlation between highway mileage and city mileage for cars from 1999 and again for cars from 2008. What's the difference in the two correlation statistics?

In [None]:
#!!!DO NOT TOUCH ANYTHING BELOW HERE!!!#
def func_datgen(pernoseq):
  np.random.seed(pernoseq)
  tempdat = pd.read_csv('https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv')
  tempdat = tempdat.loc[np.random.choice(np.arange(0, len(tempdat)), int((len(tempdat)/2)), replace = False)]
  targetidx = np.random.choice(np.arange(0, len(tempdat)), np.random.choice(np.arange(10, 15)), replace = False)
  tempdat = tempdat.reset_index(drop = True)
  tempdat.loc[targetidx, 'hwy'] = tempdat.loc[targetidx, 'hwy'] + np.random.uniform(6, 8, len(targetidx))
  tempdat.loc[targetidx, 'cty'] = tempdat.loc[targetidx, 'cty'] + np.random.uniform(-8, -6, len(targetidx))
  return tempdat
#!!!DO NOT TOUCH ANYTHING ABOVE HERE!!!#

In [None]:
cardat = func_datgen(104108)

In [None]:
#Q1 

hwylesscty = cardat["hwylesscty"] = cardat['hwy'] - cardat['cty']
#print(cardat)
'''
q3 = cardat['hwylesscty'].quantile([0.75])
q1 = cardat['hwylesscty'].quantile([0.25])
print(q3)
print(q1)
#9-5 = 4
'''

q1,q3 = np.percentile(hwylesscty,[25,75])
print(q1,q3)
iqr = q3-q1
print(iqr)

5.0 9.0
4.0


In [None]:
#Q2
print(cardat['hwylesscty'].min())
print(cardat['hwylesscty'].max())

factor = 1.5*iqr
lower = q1 - factor
higher = q3 + factor
#print(lower)
#print(higher)

y= [i for i in cardat['hwylesscty'] if i< lower]
print(len(y))

x = [i for i in cardat['hwylesscty'] if i>higher]
print(len(x))

print(len(x)+len(y))

2.0
25.024148295300378
0
13
13


In [None]:
#Q3 
stats.pearsonr(cardat["hwy"], cardat["cty"])
#first number is the correlation

(0.6069401112844439, 4.052375683320221e-13)

In [None]:
#Q4
new = cardat[cardat['hwylesscty']<higher]
#print(new)
stats.pearsonr(new["hwy"], new["cty"])


(0.9471314848524123, 3.64669433774186e-52)

In [None]:
#Q5
nine =cardat[cardat["year"]==1999]
two =cardat[cardat["year"]==2008]
#print(nine)
print(stats.pearsonr(nine["hwy"], nine["cty"]))
print(stats.pearsonr(two["hwy"], two["cty"]))

#subtract second by first (1st #s)


(0.7075842416462013, 2.598544737531751e-10)
(0.5097355591685532, 5.122598157322228e-05)


In [None]:
#The code chunk below demonstrates how to export your answers into a .csv file
#Fill in each part with your answers:
  #exportobj = pd.DataFrame({'PerNoSeq': ,'Question1': , 'Question2': , 'Question3': , 'Question4': , 'Question5':, 'CollaboratorNames':})
      #Note, fill in with '' if no collaborators; if multiple, type names in one '' separated with commas
exportobj = pd.DataFrame({'PerNoSeq': [104108],'Question1': [4.0], 'Question2': [13], 'Question3': [0.607], 'Question4': [0.947], 'Question5': [0.198], 'CollaboratorNames': ['Ruby Spadone']})
exportobj.to_csv("KaileeHollisterWeekW10.csv")
#Then, export your object with the code below
  #exportobj.to_csv("EddieKimWeekW10.csv") <- change the name to your own!
    #Remember that after exporting, the file will appear in the "Files" tab (check the LHS of the screen); from there, download onto your machine, and upload it to Blackboard

Based on the personal number sequence `12345`, the answers to the above questions should be as follows:

1: 4.0

2: 10

3: 0.738

4: 0.952

5: 0.0717