Joshua Curtis

### General instructions: 
This is a group project. The name of the group members should appear clearly at the top of the notebook. Give variables and functions a meaningful name (as much as possible). Comment your code in a way that explains what is done (especially inside functions).

### 1 Preparation
The data for this project is on my Github page. In the course repository there
is a file named cardata2005 (both in json and csv formats) which includes
information on prices, quantities and features of cars sold in the US to private
people. There is also a file named census historical household data.xls which
contains the number of households in the US in previous years. When you
are done, before submitting the notebook, re-run the whole notebook.

1. In the beginning of your notebook have a box in which you import all the packages that will be used in the notebook. Each such package import should be commented on explaining what use will be made of this package (or group of packages) in your notebook.


In [1]:
# Packages used

#to upload the data:
import requests, json 

#for plotting:
import matplotlib.pyplot as plt 

# Dataframes and math:
import numpy as np
import pandas as pd
import math

# K-means clustering
from sklearn.cluster import KMeans

# For linear regression
import statsmodels.formula.api as smf

2. import the data from Github and store it in a dataframe. Report how many cars are in the dataframe and how many columns it has.

In [2]:
url = "https://raw.githubusercontent.com/ArieBeresteanu/Econ-1923/main/demand_estimation/cardata2005.json"
res = requests.get(url).json()
cars = pd.DataFrame(res)
cars.to_excel("RawDemandEstimationData.xlsx", sheet_name='RawData')

In [3]:
n_cars, n_cols = cars.shape

print(f"The data contains information on {n_cars} cars and it has [n_cols] columns.")
#cars.info()

The data contains information on 2199 cars and it has [n_cols] columns.


In [4]:
# Remove blank rows from bottom of df
cars = cars.drop(cars.index[217:])

3. create a new column in your dataframe which includes the name of the category in which each car is included.

In [5]:
typeFix = ['Quantity','Price','wheel_base','length','width','mpg_city','mpg_highway','hp', 'disp','weight']
# Convert typeFix items from Object to float in order to use nlargest() and solve some other problems down the line.
cars[typeFix] = cars[typeFix].astype(float)
#convert hybrid to type int
cars['hybrid'] = cars['hybrid'].astype(float)
#cars[typeFix] = pd.to_numeric(cars[typeFix],errors ='coerce')
#cars.info()

In [6]:
cars['category'] = cars['segm1'].map(lambda x: math.floor((x)/10))

# using a dictionary

categoryDict = {
    '0': 'passenger cars',
    '2': 'minivans',
    '3': 'SUV',
    '4': 'light trucks'   
}
cars['categoryName'] = cars['category'].map(lambda x: categoryDict[str(x)])

carCat = pd.crosstab(index=cars['categoryName'], columns='count')

cars
# Create new variables:
# new features: footprint ect
# market shares: get # HHs
# Catagorical Variables (grouping)
# IV's (Based on features / ---- source)
# -- List of features
# -- feature averages per category
# -- dist2cat
# -- dist2cat/2


Unnamed: 0,car,year,firm_id,firm_name,division,model,hybrid,segm1,Quantity,Price,...,width,weight,disp,hp,mpg_city,mpg_highway,Unnamed: 18,__1,category,categoryName
0,0,2005,3,HONDA,Acura,MDX,0,39,57948.0,36970.0,...,77.0,4451.0,3.5,265.0,17.0,23.0,,,3,SUV
1,0,2005,8,BMW,BMW,X3,0,39,30769.0,30995.0,...,73.0,4001.0,2.5,184.0,17.0,24.0,,,3,SUV
2,0,2005,8,BMW,BMW,X5,0,39,37598.0,42395.0,...,73.7,4652.0,3.0,225.0,15.0,21.0,,,3,SUV
3,0,2005,19,GM,Buick,Rainier,0,34,15271.0,35765.0,...,75.4,4442.0,4.2,275.0,16.0,21.0,,,3,SUV
4,0,2005,19,GM,Buick,Rendezvous,0,38,60589.0,27270.0,...,73.6,4024.0,3.4,185.0,19.0,26.0,,,3,SUV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,1,2005,7,VOLKS,volkswagen,passat,0,3,49233.0,24955.0,...,68.7,3422.0,1.8,170.0,21.0,30.0,,,0,passenger cars
213,1,2005,6,VOLVO,volvo,S40,0,3,24241.0,23945.0,...,69.7,3084.0,2.4,168.0,20.0,27.0,,,0,passenger cars
214,1,2005,6,VOLVO,volvo,s60,0,3,24695.0,27920.0,...,71.0,3662.0,2.4,168.0,19.0,26.0,,,0,passenger cars
215,1,2005,6,VOLVO,volvo,v70 c70 s70,0,4,22823.0,29445.0,...,71.0,3448.0,2.4,168.0,19.0,26.0,,,0,passenger cars


4. report descriptive statistics for the features of the cars in the data as well as prices and quantities. I leave it to you to decide how and what. General rule is that descriptive statistics, as the name suggests, are meant to describe the data in some way which is informative. Comment on your findings.

In [7]:
# Count number of models by category name 
cars.groupby(['categoryName']).size()

categoryName
SUV                71
light trucks       14
minivans           16
passenger cars    116
dtype: int64

In [8]:
# Get the total quantity sold of each category 
cars.groupby(['categoryName'])['Quantity'].sum()

categoryName
SUV               4419393.0
light trucks      3094809.0
minivans          1132949.0
passenger cars    7378033.0
Name: Quantity, dtype: float64

In [9]:
# Combined mpg
cars['mpg_combined'] = cars['mpg_city']*0.55+cars['mpg_highway']*0.45

# foot print in 100s of square inches 
cars['footprint'] = cars['width'] * cars['length'] /1000  #rescaling

In [10]:
discChars = ['Quantity','Price','wheel_base','length','width','mpg_city','mpg_highway','hp', 'disp','weight']
print(cars.groupby(['categoryName'])[discChars].mean())

                     Quantity         Price  wheel_base      length  \
categoryName                                                          
SUV              62244.971831  32046.380282  111.636620  189.300000   
light trucks    221057.785714  20363.142857  122.257143  205.364286   
minivans         70809.312500  24172.812500  126.531250  201.681250   
passenger cars   63603.732759  28973.931034  106.210345  183.897414   

                    width   mpg_city  mpg_highway          hp      disp  \
categoryName                                                              
SUV             74.469014  17.126761    22.014085  229.126761  3.714085   
light trucks    73.792857  18.285714    23.428571  218.142857  3.835714   
minivans        76.537500  17.312500    23.000000  201.500000  3.706250   
passenger cars  71.014655  22.525862    29.206897  192.387931  2.830172   

                     weight  
categoryName                 
SUV             4313.521127  
light trucks    4128.500000  
mi

In [11]:
cars.groupby(['categoryName'])[discChars].std()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
categoryName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SUV,56580.169919,12208.051861,8.700579,14.640921,3.692099,3.593365,4.148986,56.038999,1.1285,918.357087
light trucks,268992.032165,7390.979782,11.365313,15.438743,4.783861,3.770912,4.182643,68.067678,1.288772,849.263369
minivans,68802.863987,2377.987222,24.071746,10.885171,2.604835,1.922455,3.119829,25.250743,0.593822,410.646157
passenger cars,74602.601515,15670.670565,6.399013,13.316426,3.433372,6.780357,5.089661,63.624137,0.937176,512.998368


In [12]:
cars.groupby(['categoryName'])[discChars].min()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
categoryName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SUV,1334.0,14195.0,93.4,150.2,66.5,12.0,14.0,108.0,1.5,2269.0
light trucks,5872.0,13980.0,109.4,187.5,66.2,12.0,16.0,143.0,2.3,3010.0
minivans,3436.0,18995.0,111.2,189.3,72.0,14.0,18.0,150.0,2.4,3772.0
passenger cars,666.0,10390.0,89.2,143.1,65.7,16.0,23.0,67.0,1.0,1850.0


In [13]:
cars.groupby(['categoryName'])[discChars].max()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
categoryName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SUV,244150.0,78420.0,137.1,226.4,84.3,30.0,34.0,345.0,6.0,6680.0
light trucks,901463.0,43055.0,140.5,229.7,79.6,24.0,29.0,345.0,5.7,5648.0
minivans,180759.0,29695.0,211.0,224.1,79.4,20.0,27.0,255.0,4.6,5258.0
passenger cars,431703.0,90620.0,121.5,216.2,83.0,61.0,56.0,400.0,6.0,4399.0


In [14]:
# Count number of models by firm name 
cars.groupby(['firm_name']).size()

firm_name
BMW          6
CHRYSLER    16
FORD        25
GM          46
HONDA       14
HYUNDAI      7
ISUZU        1
JAGUAR       4
KIA          6
LAND         2
MAZDA        6
MERCEDEZ     8
MINI         1
MITSUBIS     6
NISSAN      15
PORSCHE      3
SAAB         3
SUBARU       4
SUZUKI       4
TOYOTA      25
VOLKS       10
VOLVO        5
dtype: int64

In [15]:
# Get the total quantity sold of each firm 
cars.groupby(['firm_name'])['Quantity'].sum()

firm_name
BMW          256249.0
CHRYSLER    2019001.0
FORD        2914332.0
GM          4024172.0
HONDA       1411886.0
HYUNDAI      455012.0
ISUZU          7585.0
JAGUAR        30424.0
KIA          246842.0
LAND          21487.0
MAZDA        238903.0
MERCEDEZ     202955.0
MINI          40820.0
MITSUBIS     118638.0
NISSAN      1051466.0
PORSCHE       30449.0
SAAB          36071.0
SUBARU       181205.0
SUZUKI        66396.0
TOYOTA      2252323.0
VOLKS        301052.0
VOLVO        117916.0
Name: Quantity, dtype: float64

In [16]:
cars.groupby(['firm_name'])[discChars].mean()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
firm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BMW,42708.166667,41828.333333,109.666667,181.516667,72.15,18.333333,26.0,214.333333,2.9,3766.166667
CHRYSLER,126187.5625,23457.125,112.5625,189.36875,73.96875,19.5,24.5,192.5625,3.0875,3809.5
FORD,116573.28,27707.2,115.404,198.952,74.332,18.08,23.64,217.72,3.884,4143.6
GM,87482.0,28484.021739,116.3,196.195652,73.956522,18.521739,24.717391,220.652174,3.895652,4015.369565
HONDA,100849.0,25946.071429,105.25,181.914286,71.192857,27.214286,32.142857,193.5,2.557143,3333.428571
HYUNDAI,65001.714286,17594.714286,102.8,177.728571,69.985714,23.0,29.285714,146.0,2.314286,3068.0
ISUZU,7585.0,29254.0,129.0,207.6,76.1,15.0,19.0,275.0,4.2,4790.0
JAGUAR,7606.0,51970.0,110.65,191.325,81.15,18.0,26.0,253.75,3.475,3715.0
KIA,41140.333333,17730.833333,105.916667,183.166667,71.066667,20.0,27.0,161.166667,2.75,3553.333333
LAND,10743.5,36245.0,107.35,182.5,78.45,15.5,19.5,237.0,3.45,4533.0


In [17]:
cars.groupby(['firm_name'])[discChars].max()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
firm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BMW,106950.0,70595.0,117.7,198.0,74.9,21.0,29.0,325.0,4.4,4652.0
CHRYSLER,400543.0,34620.0,140.5,229.7,84.3,29.0,32.0,345.0,5.7,5453.0
FORD,901463.0,50435.0,138.0,226.6,79.9,26.0,32.0,302.0,5.4,6680.0
GM,705980.0,53895.0,211.0,224.1,81.2,30.0,35.0,400.0,6.0,6400.0
HONDA,352467.0,49470.0,118.1,201.2,77.3,61.0,56.0,300.0,3.5,4451.0
HYUNDAI,130365.0,24994.0,108.3,191.9,72.7,29.0,35.0,194.0,3.5,3651.0
ISUZU,7585.0,29254.0,129.0,207.6,76.1,15.0,19.0,275.0,4.2,4790.0
JAGUAR,10941.0,70495.0,119.4,200.4,83.0,18.0,26.0,294.0,4.2,3806.0
KIA,56088.0,25790.0,114.6,196.0,74.6,25.0,34.0,200.0,3.5,4802.0
LAND,19346.0,44995.0,113.6,190.9,81.5,17.0,21.0,300.0,4.4,5426.0


In [18]:
cars.groupby(['firm_name'])[discChars].min()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
firm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BMW,10045.0,29995.0,98.2,161.1,68.5,15.0,21.0,184.0,2.5,2932.0
CHRYSLER,14665.0,14160.0,93.4,150.2,67.1,12.0,16.0,132.0,2.0,2581.0
FORD,8166.0,14235.0,103.0,174.3,66.2,12.0,15.0,136.0,2.0,2697.0
GM,3436.0,11995.0,100.5,171.9,67.2,12.0,15.0,130.0,1.8,2692.0
HONDA,666.0,14375.0,94.5,155.1,66.7,17.0,22.0,67.0,1.0,1850.0
HYUNDAI,17645.0,10544.0,96.1,166.7,65.7,18.0,25.0,104.0,1.6,2280.0
ISUZU,7585.0,29254.0,129.0,207.6,76.1,15.0,19.0,275.0,4.2,4790.0
JAGUAR,2282.0,30995.0,102.0,183.9,78.8,18.0,26.0,192.0,2.5,3498.0
KIA,18668.0,10390.0,94.9,166.9,65.9,16.0,19.0,104.0,1.6,2403.0
LAND,2141.0,27495.0,101.1,174.1,75.4,14.0,18.0,174.0,2.5,3640.0


In [19]:
cars.groupby(['firm_name'])[discChars].std()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
firm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BMW,34828.715161,15046.284148,6.631038,12.724373,2.388095,2.160247,3.03315,56.641563,0.761577,682.47884
CHRYSLER,90435.593818,5159.983358,12.527669,20.432383,4.973358,4.082483,4.016632,55.953515,0.91351,744.444312
FORD,177447.573102,8919.37744,9.722932,14.16916,3.731255,3.639139,4.4053,51.383785,1.068363,938.794617
GM,115303.87398,10732.886462,16.555241,12.858373,4.07778,3.981843,5.230947,61.550058,1.068739,875.556746
HONDA,107674.381737,9362.187692,5.304244,12.008321,3.611193,12.861263,9.49378,74.323358,0.846453,787.237596
HYUNDAI,44200.118042,4836.993237,4.045986,8.986418,2.505613,4.0,3.545621,28.495614,0.628301,489.629111
ISUZU,,,,,,,,,,
JAGUAR,3724.740975,17549.050307,7.784814,7.102758,1.755942,0.0,0.0,49.681485,0.861684,145.269864
KIA,14316.985558,5546.85534,6.716969,11.071525,3.317027,4.147288,5.932959,39.861845,0.859651,918.029774
LAND,12165.77217,12374.368671,8.838835,11.879394,4.313351,2.12132,2.12132,89.095454,1.343503,1262.892711


In [20]:
# Count number of models by firm name 
cars.groupby(['division']).size()

division
Acura          1
BMW            2
Buick          2
Cadillac       3
Chevrolet     12
              ..
subaru         1
suzuki         1
toyota         1
volkswagen     5
volvo          4
Length: 61, dtype: int64

In [21]:
cars.groupby(['division'])[discChars].mean()

Unnamed: 0_level_0,Quantity,Price,wheel_base,length,width,mpg_city,mpg_highway,hp,disp,weight
division,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Acura,57948.000000,36970.000000,106.300000,188.700000,77.000000,17.000000,23.000000,265.000000,3.500000,4451.000000
BMW,34183.500000,36695.000000,110.550000,181.700000,73.350000,16.000000,22.500000,204.500000,2.750000,4326.500000
Buick,37930.000000,31517.500000,112.600000,189.950000,74.500000,17.500000,23.500000,230.000000,3.800000,4233.000000
Cadillac,24714.333333,48881.666667,120.666667,204.966667,77.000000,14.333333,19.333333,298.333333,4.966667,5160.666667
Chevrolet,139867.083333,27732.083333,125.458333,199.800000,75.416667,16.333333,21.583333,229.583333,4.308333,4325.250000
...,...,...,...,...,...,...,...,...,...,...
subaru,53541.000000,21870.000000,99.400000,175.200000,68.300000,23.000000,30.000000,165.000000,2.500000,3090.000000
suzuki,7967.000000,13994.000000,97.600000,171.300000,67.700000,25.000000,31.000000,155.000000,2.300000,2661.000000
toyota,107897.000000,21415.000000,106.300000,175.000000,67.900000,60.000000,51.000000,76.000000,1.500000,2890.000000
volkswagen,44675.000000,22903.000000,103.060000,174.140000,69.820000,21.800000,28.200000,151.000000,2.200000,3391.200000


5. How many hybrid cars are included in the data set? Which are these cars and to which category they belong? Comment on your findings.

In [22]:
print(cars.groupby(['hybrid']).size())
print(cars.groupby(['categoryName', 'hybrid']).size())

hybrid
0    213
1      4
dtype: int64
categoryName    hybrid
SUV             0          71
light trucks    0          14
minivans        0          16
passenger cars  0         112
                1           4
dtype: int64


There are only 4 hybrid cars from the 2005 data sets. All of the hybrid cars belong to the passenger cars category.

6. What were the top 3 and bottom 3 selling car models in the US in 2005?

In [23]:
#Get three top 3 selling cars
print(cars[['firm_name','division','model','Quantity']].nlargest(3,'Quantity'))

    firm_name   division          model  Quantity
33       FORD       Ford       F series  901463.0
14         GM  Chevrolet  Silverado C/K  705980.0
204    TOYOTA     TOYOTA          camry  431703.0


In [24]:
#Get bottom top 3 selling cars
print(cars[['firm_name','division','model','Quantity']].nsmallest(3,'Quantity'))

    firm_name  division    model  Quantity
140     HONDA     honda  insight     666.0
147    NISSAN  infiniti  q45 m45    1129.0
71   MERCEDEZ  MERCEDEZ  G class    1334.0


7. Declare a variable that contains the number of households in the US in 2005 (taken from the excel file on my Github page).

In [25]:
Households = 113343000

8. Define variables like footprint and combined miles per gallon (and any other variables that you might need).

### 2 First Stage Regression
1. Generate instrumental variables which are based on the distance of a product from the set of products it competes with. Consider generating instrumental variables which are based on features that you intend to use in the second stage as well as on features that you will not use in the second stage.

In [26]:
# Run the 1st stage regression and save the predicted values 
characteristics = ['mpg_combined','footprint', 'hp', 'disp', 'weight']

featuresAvg = cars.groupby(['categoryName'])[characteristics].mean()

cars['categoryCount'] = cars['categoryName'].map(lambda x: carCat.loc[x,'count'])

def dist2Cat(characteristics):
    #characteristics is a list of strings. Each string in the list is a name of a characteristic
    for ch in characteristics:
        # 1. expand
        cars[ch+'Avg'] = cars['categoryName'].map(lambda x: featuresAvg[ch][x])
        # 2. difference
        cars[ch+'Dist'] = cars[ch]-cars[ch+'Avg']
        # 3. square
        cars[ch+'Dist'] = cars[ch+'Dist'].map(lambda x: x*x)

dist2Cat(characteristics)
cars

Unnamed: 0,car,year,firm_id,firm_name,division,model,hybrid,segm1,Quantity,Price,...,mpg_combinedAvg,mpg_combinedDist,footprintAvg,footprintDist,hpAvg,hpDist,dispAvg,dispDist,weightAvg,weightDist
0,0,2005,3,HONDA,Acura,MDX,0,39,57948.0,36970.0,...,19.326056,0.139834,14.134972,0.155968,229.126761,1286.889308,3.714085,0.045832,4313.521127,18900.440587
1,0,2005,8,BMW,BMW,X3,0,39,30769.0,30995.0,...,19.326056,0.678883,14.134972,1.034029,229.126761,2036.424519,3.714085,1.474001,4313.521127,97669.454672
2,0,2005,8,BMW,BMW,X5,0,39,37598.0,42395.0,...,19.326056,2.644059,14.134972,0.355552,229.126761,17.030153,3.714085,0.509917,4313.521127,114567.947629
3,0,2005,19,GM,Buick,Rainier,0,34,15271.0,35765.0,...,19.326056,1.157897,14.134972,0.200156,229.126761,2104.354096,3.714085,0.236114,4313.521127,16506.820869
4,0,2005,19,GM,Buick,Rendezvous,0,38,60589.0,27270.0,...,19.326056,7.974658,14.134972,0.166931,229.126761,1947.170998,3.714085,0.098649,4313.521127,83822.482841
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,1,2005,7,VOLKS,volkswagen,passat,0,3,49233.0,24955.0,...,25.532328,0.232640,13.089287,0.133990,192.387931,501.219456,2.830172,1.061255,3261.431034,25782.392687
213,1,2005,6,VOLVO,volvo,S40,0,3,24241.0,23945.0,...,25.532328,5.675485,13.089287,0.687335,192.387931,594.771180,2.830172,0.185048,3261.431034,31481.771998
214,1,2005,6,VOLVO,volvo,s60,0,3,24695.0,27920.0,...,25.532328,11.440140,13.089287,0.087076,192.387931,594.771180,2.830172,0.185048,3261.431034,160455.496136
215,1,2005,6,VOLVO,volvo,v70 c70 s70,0,4,22823.0,29445.0,...,25.532328,11.440140,13.089287,0.005493,192.387931,594.771180,2.830172,0.185048,3261.431034,34807.978894


In [27]:
def dist2CatV2(characteristics):
    #characteristics is a list of strings. Each string in the list is a name of a characteristic
    for ch in characteristics:
        # 1. expand
        #cars[ch+'Avg'] = cars['categoryName'].map(lambda x: featuresAvg[ch][x])
        cars[ch+'Avg2'] = (cars[ch+'Avg']*cars['categoryCount'] - cars[ch])/(cars['categoryCount']-1)
        # 2. difference
        cars[ch+'Dist2'] = cars[ch]-cars[ch+'Avg2']
        # 3. square
        cars[ch+'Dist2'] = cars[ch+'Dist2'].map(lambda x: x*x)

dist2CatV2(characteristics)
cars

Unnamed: 0,car,year,firm_id,firm_name,division,model,hybrid,segm1,Quantity,Price,...,mpg_combinedAvg2,mpg_combinedDist2,footprintAvg2,footprintDist2,hpAvg2,hpDist2,dispAvg2,dispDist2,weightAvg2,weightDist2
0,0,2005,3,HONDA,Acura,MDX,0,39,57948.0,36970.0,...,19.320714,0.143858,14.129330,0.160456,228.614286,1323.920204,3.717143,0.047151,4311.557143,19444.310408
1,0,2005,8,BMW,BMW,X3,0,39,30769.0,30995.0,...,19.314286,0.698418,14.149499,1.063783,229.771429,2095.023673,3.731429,1.516416,4317.985714,100479.943061
2,0,2005,8,BMW,BMW,X5,0,39,37598.0,42395.0,...,19.349286,2.720143,14.143490,0.365783,229.185714,17.520204,3.724286,0.524590,4308.685714,117864.698776
3,0,2005,19,GM,Buick,Rainier,0,34,15271.0,35765.0,...,19.341429,1.191216,14.128581,0.205916,228.471429,2164.907959,3.707143,0.242908,4311.685714,16981.813061
4,0,2005,19,GM,Buick,Rendezvous,0,38,60589.0,27270.0,...,19.285714,8.204133,14.140809,0.171735,229.757143,2003.201837,3.718571,0.101488,4317.657143,86234.517551
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,1,2005,7,VOLKS,volkswagen,passat,0,3,49233.0,24955.0,...,25.536522,0.236703,13.092470,0.136331,192.582609,509.974216,2.839130,1.079792,3260.034783,26232.731645
213,1,2005,6,VOLVO,volvo,S40,0,3,24241.0,23945.0,...,25.553043,5.774618,13.096496,0.699341,192.600000,605.160000,2.833913,0.188281,3262.973913,32031.661550
214,1,2005,6,VOLVO,volvo,s60,0,3,24695.0,27920.0,...,25.561739,11.639964,13.091853,0.088597,192.600000,605.160000,2.833913,0.188281,3257.947826,163258.159244
215,1,2005,6,VOLVO,volvo,v70 c70 s70,0,4,22823.0,29445.0,...,25.561739,11.639964,13.088642,0.005589,192.600000,605.160000,2.833913,0.188281,3259.808696,35415.967032


In [28]:
firstStageV1 = smf.ols(formula='Price ~  hybrid + disp + mpg_combined + footprint + dispDist + mpg_combinedDist + footprintDist',data=cars).fit()

print(firstStageV1.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.402
Model:                            OLS   Adj. R-squared:                  0.382
Method:                 Least Squares   F-statistic:                     20.06
Date:                Sun, 24 Apr 2022   Prob (F-statistic):           1.76e-20
Time:                        09:31:51   Log-Likelihood:                -2321.3
No. Observations:                 217   AIC:                             4659.
Df Residuals:                     209   BIC:                             4686.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         4.987e+04   1.33e+04  

2. estimate the first stage regression and save the predicted values.

3. What is the R2 parameter of the first stage regression? Is it high enough to believe that the instrumental variables you included are relevant to predicting the price?

In [30]:
# (verify R^2 > 0.1, at least some of the IV's are significant)

### 3 Second Stage regression
1. Create a correlation matrix for the features of the cars that you contemplate using in the second stage regression. If you see very high correlation coefficients (e.g. above 0.75 in absolute value). Comment on

2. Using the predicted value for price from the first stage regression, estimate the second stage regression. Include the dummy variable for the features, the dummy variables and the price. (Here it is important to use robust standard errors.)

In [None]:
# Run second stage regression and look at the coefficients

# From 96 data
# first stage: run regression on price
# X's: hp, mpgcombined, footprint, C('category')
# IV's: hpdist2, mpgcombined2, footprintdist2


# 2nd Stage also run regression on price, x's p-hat 
# Yj = (logshare ; -logshare0) / (hp, mpg_combined, footprint)
# j = toyota camry: alpha-hat = -8.135*10^-5, Pj = 16,758 ; Share = 0.331%, elasticity price = -alpha-hat * Pj = 0.455
# j = BMW: alpha-hat: -2.8713, Pj = 35300; Share = 0.0229% ; elasticity price = -alpha-hat * Pj = 0.0657%

3. Analyze your results. What is the interpretation of the coefficients on the features, the dummy variables and the price.