# - PROBLEM SET
 The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. 
Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and 
find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in 
increasing sales.

In [1]:
import pandas as pd
import numpy as np

#Read files:
train = pd.read_csv("TrainFile.csv")
test = pd.read_csv("Test_u94Q5KV.csv")

In [2]:
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, data.shape)

(8523, 13) (5681, 12) (14204, 13)


We will explore the problem in following stages:

- Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
- Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
- Data Cleaning – imputing missing values in the data and checking for outliers
- Feature Engineering – modifying existing variables and creating new ones for analysis
- Model Building – making predictive models on the data

In [3]:
data.apply(lambda x: sum(x.isnull()))

Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

In [4]:
#Filter categorical variables
categorical_columns = [x for x in data.dtypes.index if data.dtypes[x]=='object']
#Exclude ID cols and source:
categorical_columns = [x for x in categorical_columns if x not in ['Item_Identifier','Outlet_Identifier','source']]
#Print frequency of categories
for col in categorical_columns:
    print('\nFrequency of Categories for varible %s' %col)
    print(data[col].value_counts())


Frequency of Categories for varible Item_Fat_Content
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

Frequency of Categories for varible Item_Type
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64

Frequency of Categories for varible Outlet_Location_Type
Tier 3    5583
Tier 2    4641
Tier 1    3980
Name: Outlet_Location_Type, dtype: int64

Frequency of Categories for varible Outlet_Size
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64

F

In [5]:
data['Item_Weight'] = data.groupby('Item_Identifier')['Item_Weight'].transform('mean')

In [6]:
print(data['Item_Weight'].isnull().sum())

0


In [7]:
data['Outlet_Type'].value_counts()

Supermarket Type1    9294
Grocery Store        1805
Supermarket Type3    1559
Supermarket Type2    1546
Name: Outlet_Type, dtype: int64

In [8]:
data.groupby('Outlet_Type')['Outlet_Size'].value_counts()

Outlet_Type        Outlet_Size
Grocery Store      Small           880
Supermarket Type1  Small          3100
                   High           1553
                   Medium         1550
Supermarket Type2  Medium         1546
Supermarket Type3  Medium         1559
Name: Outlet_Size, dtype: int64

In [9]:
y = data.groupby('Outlet_Type')['Outlet_Size'].apply(lambda x : x.value_counts().index[0])
print(y)

Outlet_Type
Grocery Store         Small
Supermarket Type1     Small
Supermarket Type2    Medium
Supermarket Type3    Medium
Name: Outlet_Size, dtype: object


In [10]:
print(data['Outlet_Size'].isnull().sum())

4016


In [11]:
#Get a boolean variable specifying missing Item_Weight values
miss_bool = data['Outlet_Size'].isnull() 

#Impute data and check #missing values before and after imputation to confirm
print('\nOrignal #missing: %d'% sum(miss_bool))
data.loc[miss_bool,'Outlet_Size'] = data.loc[miss_bool,'Outlet_Type'].apply(lambda x: y[x])



Orignal #missing: 4016


In [12]:
print(data['Outlet_Size'])

0        Medium
1        Medium
2        Medium
3         Small
4          High
5        Medium
6          High
7        Medium
8         Small
9         Small
10       Medium
11        Small
12       Medium
13        Small
14         High
15        Small
16       Medium
17       Medium
18       Medium
19        Small
20         High
21       Medium
22        Small
23        Small
24        Small
25        Small
26        Small
27         High
28        Small
29        Small
          ...  
14174      High
14175     Small
14176     Small
14177      High
14178    Medium
14179    Medium
14180     Small
14181      High
14182    Medium
14183      High
14184    Medium
14185     Small
14186    Medium
14187    Medium
14188    Medium
14189     Small
14190     Small
14191     Small
14192    Medium
14193     Small
14194    Medium
14195    Medium
14196    Medium
14197     Small
14198    Medium
14199     Small
14200    Medium
14201     Small
14202     Small
14203     Small
Name: Outlet_Size, Lengt

In [13]:
print(sum(data['Outlet_Size'].isnull()))

0


In [14]:
data.head()

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source
0,Low Fat,FDA15,249.8092,3735.138,Dairy,0.016047,9.3,1999,OUT049,Tier 1,Medium,Supermarket Type1,train
1,Regular,DRC01,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,OUT018,Tier 3,Medium,Supermarket Type2,train
2,Low Fat,FDN15,141.618,2097.27,Meat,0.01676,17.5,1999,OUT049,Tier 1,Medium,Supermarket Type1,train
3,Regular,FDX07,182.095,732.38,Fruits and Vegetables,0.0,19.2,1998,OUT010,Tier 3,Small,Grocery Store,train
4,Low Fat,NCD19,53.8614,994.7052,Household,0.0,8.93,1987,OUT013,Tier 3,High,Supermarket Type1,train


In [98]:
data.pivot_table(values='Item_Outlet_Sales',index='Outlet_Type')

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Type,Unnamed: 1_level_1
Grocery Store,339.8285
Supermarket Type1,2316.181148
Supermarket Type2,1995.498739
Supermarket Type3,3694.038558


In [15]:
y = data.pivot_table(values = 'Item_Visibility',index = 'Item_Identifier')
print(y)

                 Item_Visibility
Item_Identifier                 
DRA12                   0.034938
DRA24                   0.045646
DRA59                   0.133384
DRB01                   0.079736
DRB13                   0.006799
DRB24                   0.020596
DRB25                   0.079407
DRB48                   0.023973
DRC01                   0.020653
DRC12                   0.037862
DRC13                   0.028408
DRC24                   0.026913
DRC25                   0.047354
DRC27                   0.066423
DRC36                   0.046932
DRC49                   0.070950
DRD01                   0.066330
DRD12                   0.074150
DRD13                   0.049125
DRD15                   0.064930
DRD24                   0.035205
DRD25                   0.082385
DRD27                   0.020545
DRD37                   0.013352
DRD49                   0.167987
DRD60                   0.040369
DRE01                   0.179808
DRE03                   0.026061
DRE12     

In [16]:
z = data.groupby('Item_Identifier')['Item_Visibility'].apply(lambda x : x.mean())
print(z)

Item_Identifier
DRA12    0.034938
DRA24    0.045646
DRA59    0.133384
DRB01    0.079736
DRB13    0.006799
DRB24    0.020596
DRB25    0.079407
DRB48    0.023973
DRC01    0.020653
DRC12    0.037862
DRC13    0.028408
DRC24    0.026913
DRC25    0.047354
DRC27    0.066423
DRC36    0.046932
DRC49    0.070950
DRD01    0.066330
DRD12    0.074150
DRD13    0.049125
DRD15    0.064930
DRD24    0.035205
DRD25    0.082385
DRD27    0.020545
DRD37    0.013352
DRD49    0.167987
DRD60    0.040369
DRE01    0.179808
DRE03    0.026061
DRE12    0.061981
DRE13    0.031673
           ...   
NCX05    0.110962
NCX06    0.017934
NCX17    0.113709
NCX18    0.008293
NCX29    0.101920
NCX30    0.025977
NCX41    0.017291
NCX42    0.006482
NCX53    0.014409
NCX54    0.051698
NCY05    0.059645
NCY06    0.065816
NCY17    0.126951
NCY18    0.033510
NCY29    0.088295
NCY30    0.028140
NCY41    0.086582
NCY42    0.016440
NCY53    0.056916
NCY54    0.191145
NCZ05    0.063030
NCZ06    0.102096
NCZ17    0.076568
NCZ18    0.1

In [17]:
miss_bool = (data['Item_Visibility']==0)
print(miss_bool)

0        False
1        False
2        False
3         True
4         True
5         True
6        False
7        False
8        False
9        False
10        True
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
14174    False
14175    False
14176    False
14177     True
14178    False
14179    False
14180    False
14181    False
14182    False
14183    False
14184     True
14185    False
14186    False
14187    False
14188    False
14189    False
14190    False
14191    False
14192    False
14193    False
14194    False
14195    False
14196    False
14197    False
14198    False
14199    False
14200    False
14201    False
14202     True
14203    False
Name: Item_Visibility, Length: 14204, dtype: bool


In [18]:
data.loc[miss_bool,'Item_Visibility'] = data.loc[miss_bool,'Item_Identifier'].apply(lambda x : z[x])

In [19]:
print('Number of 0 values after modification: %d'%sum(data['Item_Visibility'] == 0))

Number of 0 values after modification: 0


In [20]:
#determine another variable with mean_ratio
data['Item_Visibility_Meanratio'] = data.apply(lambda x : x['Item_Visibility']/z[x['Item_Identifier']],axis = 1)

In [21]:
print(data['Item_Visibility_Meanratio'])

0        0.931078
1        0.933420
2        0.960069
3        1.000000
4        1.000000
5        1.000000
6        1.497197
7        0.870493
8        0.924160
9        0.963983
10       1.000000
11       1.036695
12       1.026360
13       0.922290
14       1.171331
15       1.028073
16       1.003140
17       1.029671
18       0.870493
19       0.922116
20       1.139904
21       0.954309
22       0.862894
23       1.531537
24       0.929633
25       0.927507
26       1.060235
27       1.035278
28       1.444581
29       1.679003
           ...   
14174    0.998238
14175    1.038721
14176    0.874563
14177    1.000000
14178    1.019804
14179    1.326824
14180    0.999070
14181    0.874001
14182    1.291142
14183    0.920760
14184    1.000000
14185    0.926739
14186    1.031682
14187    0.962406
14188    0.925131
14189    0.929633
14190    0.922290
14191    1.735564
14192    0.925131
14193    1.033309
14194    0.931078
14195    0.876089
14196    1.031964
14197    1.285095
14198    0

In [23]:
data.head(10)

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source,Item_Visibility_Meanratio
0,Low Fat,FDA15,249.8092,3735.138,Dairy,0.016047,9.3,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.931078
1,Regular,DRC01,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,0.93342
2,Low Fat,FDN15,141.618,2097.27,Meat,0.01676,17.5,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.960069
3,Regular,FDX07,182.095,732.38,Fruits and Vegetables,0.017834,19.2,1998,OUT010,Tier 3,Small,Grocery Store,train,1.0
4,Low Fat,NCD19,53.8614,994.7052,Household,0.00978,8.93,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.0
5,Regular,FDP36,51.4008,556.6088,Baking Goods,0.057059,10.395,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,1.0
6,Regular,FDO10,57.6588,343.5528,Snack Foods,0.012741,13.65,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.497197
7,Low Fat,FDP10,107.7622,4022.7636,Snack Foods,0.12747,19.0,1985,OUT027,Tier 3,Medium,Supermarket Type3,train,0.870493
8,Regular,FDH17,96.9726,1076.5986,Frozen Foods,0.016687,16.2,2002,OUT045,Tier 2,Small,Supermarket Type1,train,0.92416
9,Regular,FDU28,187.8214,4710.535,Frozen Foods,0.09445,19.2,2007,OUT017,Tier 2,Small,Supermarket Type1,train,0.963983


In [24]:
#rename the categories
data['Item_Identifier'] = data['Item_Identifier'].apply(lambda x: x[0:2])
print(data['Item_Identifier'])

0        FD
1        DR
2        FD
3        FD
4        NC
5        FD
6        FD
7        FD
8        FD
9        FD
10       FD
11       FD
12       FD
13       FD
14       FD
15       FD
16       NC
17       FD
18       DR
19       FD
20       FD
21       FD
22       NC
23       FD
24       FD
25       NC
26       FD
27       DR
28       FD
29       FD
         ..
14174    FD
14175    FD
14176    FD
14177    FD
14178    FD
14179    FD
14180    FD
14181    FD
14182    DR
14183    FD
14184    DR
14185    FD
14186    DR
14187    DR
14188    DR
14189    FD
14190    FD
14191    FD
14192    FD
14193    FD
14194    FD
14195    NC
14196    FD
14197    DR
14198    FD
14199    FD
14200    FD
14201    NC
14202    FD
14203    FD
Name: Item_Identifier, Length: 14204, dtype: object


In [25]:
#group them into 3 types
data['Item_Type_Combined'] = data['Item_Identifier'].map({'FD':'Food','NC':'Non Consumable','DR':'Drinks'})

In [26]:
data.head(10)

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source,Item_Visibility_Meanratio,Item_Type_Combined
0,Low Fat,FD,249.8092,3735.138,Dairy,0.016047,9.3,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.931078,Food
1,Regular,DR,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,0.93342,Drinks
2,Low Fat,FD,141.618,2097.27,Meat,0.01676,17.5,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.960069,Food
3,Regular,FD,182.095,732.38,Fruits and Vegetables,0.017834,19.2,1998,OUT010,Tier 3,Small,Grocery Store,train,1.0,Food
4,Low Fat,NC,53.8614,994.7052,Household,0.00978,8.93,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.0,Non Consumable
5,Regular,FD,51.4008,556.6088,Baking Goods,0.057059,10.395,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,1.0,Food
6,Regular,FD,57.6588,343.5528,Snack Foods,0.012741,13.65,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.497197,Food
7,Low Fat,FD,107.7622,4022.7636,Snack Foods,0.12747,19.0,1985,OUT027,Tier 3,Medium,Supermarket Type3,train,0.870493,Food
8,Regular,FD,96.9726,1076.5986,Frozen Foods,0.016687,16.2,2002,OUT045,Tier 2,Small,Supermarket Type1,train,0.92416,Food
9,Regular,FD,187.8214,4710.535,Frozen Foods,0.09445,19.2,2007,OUT017,Tier 2,Small,Supermarket Type1,train,0.963983,Food


In [27]:
data.pivot_table(values='Item_Outlet_Sales',index='Item_Type_Combined')

Unnamed: 0_level_0,Item_Outlet_Sales
Item_Type_Combined,Unnamed: 1_level_1
Drinks,1997.333337
Food,2215.354223
Non Consumable,2142.721364


In [28]:
data.pivot_table(values='Item_Outlet_Sales',index='Item_Identifier')

Unnamed: 0_level_0,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1
DR,1997.333337
FD,2215.354223
NC,2142.721364


In [29]:
data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year']

In [30]:
data.head(10)

Unnamed: 0,Item_Fat_Content,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,Outlet_Identifier,Outlet_Location_Type,Outlet_Size,Outlet_Type,source,Item_Visibility_Meanratio,Item_Type_Combined,Outlet_Years
0,Low Fat,FD,249.8092,3735.138,Dairy,0.016047,9.3,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.931078,Food,14
1,Regular,DR,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,0.93342,Drinks,4
2,Low Fat,FD,141.618,2097.27,Meat,0.01676,17.5,1999,OUT049,Tier 1,Medium,Supermarket Type1,train,0.960069,Food,14
3,Regular,FD,182.095,732.38,Fruits and Vegetables,0.017834,19.2,1998,OUT010,Tier 3,Small,Grocery Store,train,1.0,Food,15
4,Low Fat,NC,53.8614,994.7052,Household,0.00978,8.93,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.0,Non Consumable,26
5,Regular,FD,51.4008,556.6088,Baking Goods,0.057059,10.395,2009,OUT018,Tier 3,Medium,Supermarket Type2,train,1.0,Food,4
6,Regular,FD,57.6588,343.5528,Snack Foods,0.012741,13.65,1987,OUT013,Tier 3,High,Supermarket Type1,train,1.497197,Food,26
7,Low Fat,FD,107.7622,4022.7636,Snack Foods,0.12747,19.0,1985,OUT027,Tier 3,Medium,Supermarket Type3,train,0.870493,Food,28
8,Regular,FD,96.9726,1076.5986,Frozen Foods,0.016687,16.2,2002,OUT045,Tier 2,Small,Supermarket Type1,train,0.92416,Food,11
9,Regular,FD,187.8214,4710.535,Frozen Foods,0.09445,19.2,2007,OUT017,Tier 2,Small,Supermarket Type1,train,0.963983,Food,6


In [31]:
data['Outlet_Years'].describe()

count    14204.000000
mean        15.169319
std          8.371664
min          4.000000
25%          9.000000
50%         14.000000
75%         26.000000
max         28.000000
Name: Outlet_Years, dtype: float64

In [32]:
data.pivot_table(values='Item_Outlet_Sales',index='Outlet_Years')

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Years,Unnamed: 1_level_1
4,1995.498739
6,2340.675263
9,2438.841866
11,2192.384798
14,2348.354635
15,339.351662
16,2277.844267
26,2298.995256
28,2483.677474


In [33]:
print("Original Categories")
data['Item_Fat_Content'].value_counts()

Original Categories


Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

In [34]:
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF' : 'Low Fat','reg' : 'Regular','low fat' : 'Low Fat'})
data['Item_Fat_Content'].value_counts()

Low Fat    9185
Regular    5019
Name: Item_Fat_Content, dtype: int64

In [35]:
data.loc[data['Item_Identifier']=="NC",'Item_Fat_Content'] = "Non-Edible"

In [36]:
data['Item_Fat_Content'].value_counts()

Low Fat       6499
Regular       5019
Non-Edible    2686
Name: Item_Fat_Content, dtype: int64

In [37]:
data.dtypes

Item_Fat_Content              object
Item_Identifier               object
Item_MRP                     float64
Item_Outlet_Sales            float64
Item_Type                     object
Item_Visibility              float64
Item_Weight                  float64
Outlet_Establishment_Year      int64
Outlet_Identifier             object
Outlet_Location_Type          object
Outlet_Size                   object
Outlet_Type                   object
source                        object
Item_Visibility_Meanratio    float64
Item_Type_Combined            object
Outlet_Years                   int64
dtype: object

In [41]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet']

for i in var_mod:
    data[i] = le.fit_transform(data[i])

In [42]:
# one hot encoding
data = pd.get_dummies(data,columns = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet'])
print(data.dtypes)

Item_Identifier               object
Item_MRP                     float64
Item_Outlet_Sales            float64
Item_Type                     object
Item_Visibility              float64
Item_Weight                  float64
Outlet_Establishment_Year      int64
Outlet_Identifier             object
source                        object
Item_Visibility_Meanratio    float64
Outlet_Years                   int64
Item_Fat_Content_0             uint8
Item_Fat_Content_1             uint8
Item_Fat_Content_2             uint8
Outlet_Location_Type_0         uint8
Outlet_Location_Type_1         uint8
Outlet_Location_Type_2         uint8
Outlet_Size_0                  uint8
Outlet_Size_1                  uint8
Outlet_Size_2                  uint8
Outlet_Type_0                  uint8
Outlet_Type_1                  uint8
Outlet_Type_2                  uint8
Outlet_Type_3                  uint8
Item_Type_Combined_0           uint8
Item_Type_Combined_1           uint8
Item_Type_Combined_2           uint8
O

In [44]:
data[['Item_Fat_Content_0','Item_Fat_Content_1','Item_Fat_Content_2']].head(10)

Unnamed: 0,Item_Fat_Content_0,Item_Fat_Content_1,Item_Fat_Content_2
0,1,0,0
1,0,0,1
2,1,0,0
3,0,0,1
4,0,1,0
5,0,0,1
6,0,0,1
7,1,0,0
8,0,0,1
9,0,0,1


In [45]:
#drop the columns which have been modified

data.drop(['Item_Type','Outlet_Establishment_Year'],axis = 1,inplace = True)

#differentitate train and test
train = data.loc[data['source'] =='train']
test = data.loc[data['source']=='test']

test.drop(['Item_Outlet_Sales','source'],axis = 1,inplace = True)
train.drop(['source'],axis = 1,inplace=True)

#Export files as modified version

train.to_csv("train_modified.csv",index = False)
test.to_csv("test_modified.csv",index = False)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [49]:
import sys
print(sys.path[0])

/Users/ishashah/Applications/anaconda/lib/python36.zip


In [48]:
if __name__ == '__main__':
    # Remove the CWD from sys.path while we load stuff.
    # This is added back by InteractiveShellApp.init_path()
    if sys.path[0] == '':
        del sys.path[0]

In [51]:
mean_sales = train['Item_Outlet_Sales'].mean()

In [52]:
base1 = test[['Item_Identifier','Outlet_Identifier']]
base1['Item_Outlet_Sales'] = mean_sales

#export
base1.to_csv("algo.csv",index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [76]:
target = 'Item_Outlet_Sales'
IDcol = ['Item_Identifier','Outlet_Identifier']

from sklearn import model_selection
from sklearn.metrics import mean_squared_error

def modelfit(alg,dtrain,dtest,predictors,target,IDcol,filename):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors],dtrain[target])
    
    #predict the training set
    dtrain_predictions = alg.predict(dtrain[predictors])
    
    #perform cross validation
    cv_score = cross_validation.cross_val_score(alg,dtrain[predictors],dtrain[target],cv=10,scoring = 
                                                'mean_squared_error')
    
    cv_score = np.sqrt(np.abs(cv_score))
    
    #print model report
    print("Model Report")
    print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(dtrain[target].values,dtrain_predictions)))
    print("CV Score : Mean - %.4g, Std - %.4g, Min - %.4g, Max - %.4g" % (np.mean(cv_score), np.std(cv_score), 
                                                                          np.min(cv_score), np.max(cv_score)))
    #Predict on testing data
    dtest[target] = alg.predict(dtest[predictors])
          
    #Export Submission File
    IDcol.append(target)
    submission = pd.DataFrame({x : dtest[x] for x in IDcol})
    submission.to_csv(filename,index=False)

In [77]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
predictors = [x for x in train.columns if x not in [target]+IDcol]
# print predictors
alg1 = LinearRegression(normalize=True)
modelfit(alg1, train, test, predictors, target, IDcol, 'alg1.csv')
coef1 = pd.Series(alg1.coef_, predictors).sort_values()
coef1.plot(kind='bar', title='Model Coefficients')

  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)


Model Report
RMSE : 1127
CV Score : Mean - 1130, Std - 16.23, Min - 1110, Max - 1163


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<matplotlib.axes._subplots.AxesSubplot at 0x118977668>

In [75]:
predictors = [x for x in train.columns if x not in [target]+IDcol]
alg2 = Ridge(alpha=0.05,normalize=True)
modelfit(alg2, train, test, predictors, target, IDcol, 'alg2.csv')
coef2 = pd.Series(alg2.coef_, predictors).sort_values()
coef2.plot(kind='bar', title='Model Coefficients')

  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)


Model Report
RMSE : 1129
CV Score : Mean - 1131, Std - 17.79, Min - 1110, Max - 1167


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<matplotlib.axes._subplots.AxesSubplot at 0x118977668>