## Creating data
There are two core objects in pandas:the **DataFrame** and the **Series**

**DataFrame**
A DataFrame is a table.It contains an array of individuals entries with values.Each entry corresponds to a row(records) and a column.

In [6]:
# DataFrame example
# importing pandas library
import pandas as pd
# dataframe function
pd.DataFrame({'Yes':['50','21'], 'No':['131','2']})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In [8]:
# DataFrame with string values
pd.DataFrame({'Bob':['I liked it','It was awful'],'Sue':['Pretty good','Bland']})

Unnamed: 0,Bob,Sue
0,I liked it,Pretty good
1,It was awful,Bland


The list of row labels used in a DataFrame is known as index.We can assign values to it by using an index parameter in our constructor:

In [9]:
pd.DataFrame({'Bob':['I liked it','It was awful'],'Sue':['Pretty good','Bland']},
            index=['Product A','Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it,Pretty good
Product B,It was awful,Bland


# Series
A series is a sequence of data values.If a DataFrame is a table ,a Series is a list,you can create one with nothing more than list

In [12]:
# implenting series
pd.Series(['10','12','3','5'])

0    10
1    12
2     3
3     5
dtype: object

In [None]:
A series is a single column of DataFrame.Column values can be assigned to a series using inex parameter

In [14]:
pd.Series(['2016 sales','2015 sales'],name='Product A')

0    2016 sales
1    2015 sales
Name: Product A, dtype: object

In [22]:
# reading data with unnamed column
unnamed=pd.read_csv('data/bigmart.csv', index_col=[0])
unnamed.head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [25]:
# saving a dataframe to csv
data=pd.read_csv('data/bigmart.csv')
data.to_csv('save.csv')

In [27]:
# indexing in pandas
#Using the loc and iloc function
# selecting the first row
data.iloc[0]

Item_Identifier                          FDA15
Item_Weight                                9.3
Item_Fat_Content                       Low Fat
Item_Visibility                      0.0160473
Item_Type                                Dairy
Item_MRP                               249.809
Outlet_Identifier                       OUT049
Outlet_Establishment_Year                 1999
Outlet_Size                             Medium
Outlet_Location_Type                    Tier 1
Outlet_Type                  Supermarket Type1
Item_Outlet_Sales                      3735.14
Name: 0, dtype: object

In [28]:
# getting a column with iloc
data.iloc[:,0]

0       FDA15
1       DRC01
2       FDN15
3       FDX07
4       NCD19
        ...  
8518    FDF22
8519    FDS36
8520    NCJ29
8521    FDN46
8522    DRG01
Name: Item_Identifier, Length: 8523, dtype: object

In [29]:
# selecting the second and third entries
data.iloc[1:3,0]

1    DRC01
2    FDN15
Name: Item_Identifier, dtype: object

In [32]:
# it is also possible to pass a list
data.iloc[[0,1,2],0]

0    FDA15
1    DRC01
2    FDN15
Name: Item_Identifier, dtype: object

In [33]:
# selecting the last five element
data.iloc[-5:]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.38,Regular,0.046982,Baking Goods,108.157,OUT045,2002,,Tier 2,Supermarket Type1,549.285
8520,NCJ29,10.6,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.21,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976
8522,DRG01,14.8,Low Fat,0.044878,Soft Drinks,75.467,OUT046,1997,Small,Tier 1,Supermarket Type1,765.67


# Label-based selection
Using Loc operator.For this paradigm,its data index value,not its position,which matters

In [37]:
# getting first entry in reviews
data.loc[0,'Item_Identifier']

'FDA15'

In [39]:
# Listing specific columns
data.loc[:,['Item_Visibility','Item_MRP','Outlet_Size']]

Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Size
0,0.016047,249.8092,Medium
1,0.019278,48.2692,Medium
2,0.016760,141.6180,Medium
3,0.000000,182.0950,
4,0.000000,53.8614,High
...,...,...,...
8518,0.056783,214.5218,High
8519,0.046982,108.1570,
8520,0.035186,85.1224,Small
8521,0.145221,103.1332,Medium


# Manipulating the Index
Label based selection derives its power from the labels in the index.The index used is not immutable.

The set_index() can be used


In [43]:
data.set_index("Outlet_Establishment_Year")

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Establishment_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1999,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.1380
2009,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
1999,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,Medium,Tier 1,Supermarket Type1,2097.2700
1998,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,,Tier 3,Grocery Store,732.3800
1987,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...
1987,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,High,Tier 3,Supermarket Type1,2778.3834
2002,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,,Tier 2,Supermarket Type1,549.2850
2004,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,Small,Tier 2,Supermarket Type1,1193.1136
2009,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,Medium,Tier 3,Supermarket Type2,1845.5976


## Conditional Selection

In [56]:
# Checking if item is diary or  not
data.Item_Type =='Dairy'

0        True
1       False
2       False
3       False
4       False
        ...  
8518    False
8519    False
8520    False
8521    False
8522    False
Name: Item_Type, Length: 8523, dtype: bool

In [57]:
#The result above can be used inside loc
data.loc[data.Item_Type =='Dairy']

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
11,FDA03,18.500,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.1530
19,FDU02,13.350,Low Fat,0.102492,Dairy,230.5352,OUT035,2004,Small,Tier 2,Supermarket Type1,2748.4224
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
30,FDV38,19.250,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,,Tier 3,Grocery Store,163.7868
...,...,...,...,...,...,...,...,...,...,...,...,...
8424,FDC39,7.405,Low Fat,0.159165,Dairy,207.1296,OUT035,2004,Small,Tier 2,Supermarket Type1,3739.1328
8447,FDS26,20.350,Low Fat,0.089975,Dairy,261.6594,OUT017,2007,,Tier 2,Supermarket Type1,7588.1226
8448,FDV50,14.300,Low Fat,0.123071,Dairy,121.1730,OUT018,2009,Medium,Tier 3,Supermarket Type2,2093.9410
8457,FDY50,5.800,Low Fat,0.130931,Dairy,89.9172,OUT035,2004,Small,Tier 2,Supermarket Type1,1516.6924


In [60]:
# select Item type for dairy and are from medium Outlet_Size
data.loc[(data.Item_Type == 'Dairy') & (data.Outlet_Size == 'Medium')]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
91,DRG27,8.895,Low Fat,0.105274,Dairy,39.9138,OUT049,1999,Medium,Tier 1,Supermarket Type1,690.4346
198,FDE40,,Regular,0.098664,Dairy,62.9194,OUT027,1985,Medium,Tier 3,Supermarket Type3,2105.2596
293,FDH27,7.075,Low Fat,0.058585,Dairy,142.7128,OUT018,2009,Medium,Tier 3,Supermarket Type2,1869.5664
368,FDL51,20.700,Regular,0.047685,Dairy,212.5876,OUT018,2009,Medium,Tier 3,Supermarket Type2,1286.3256
...,...,...,...,...,...,...,...,...,...,...,...,...
8168,FDV38,19.250,Low Fat,0.101932,Dairy,54.5956,OUT049,1999,Medium,Tier 1,Supermarket Type1,764.3384
8182,DRF27,8.930,Low Fat,0.028533,Dairy,151.4340,OUT018,2009,Medium,Tier 3,Supermarket Type2,1225.0720
8280,FDB03,17.750,Regular,0.157471,Dairy,239.1538,OUT018,2009,Medium,Tier 3,Supermarket Type2,4326.3684
8306,FDV50,14.300,Low Fat,0.122762,Dairy,124.3730,OUT049,1999,Medium,Tier 1,Supermarket Type1,2093.9410


In [61]:
# select Item type for dairy and are from medium Outlet_Size and Item weight is above 14
data.loc[(data.Item_Type == 'Dairy') & (data.Outlet_Size == 'Medium') & (data.Item_Weight >= 14)]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
368,FDL51,20.70,Regular,0.047685,Dairy,212.5876,OUT018,2009,Medium,Tier 3,Supermarket Type2,1286.3256
412,FDZ38,17.60,Low Fat,0.008034,Dairy,174.2422,OUT018,2009,Medium,Tier 3,Supermarket Type2,4311.0550
423,FDA27,20.35,Regular,0.000000,Dairy,256.7672,OUT018,2009,Medium,Tier 3,Supermarket Type2,5624.6784
438,FDL51,20.70,Regular,0.047565,Dairy,213.4876,OUT049,1999,Medium,Tier 1,Supermarket Type1,1929.4884
464,DRI51,17.25,Low Fat,0.042414,Dairy,173.1764,OUT018,2009,Medium,Tier 3,Supermarket Type2,4466.1864
...,...,...,...,...,...,...,...,...,...,...,...,...
8159,FDC15,18.10,Low Fat,0.178694,Dairy,158.9288,OUT018,2009,Medium,Tier 3,Supermarket Type2,1571.2880
8168,FDV38,19.25,Low Fat,0.101932,Dairy,54.5956,OUT049,1999,Medium,Tier 1,Supermarket Type1,764.3384
8280,FDB03,17.75,Regular,0.157471,Dairy,239.1538,OUT018,2009,Medium,Tier 3,Supermarket Type2,4326.3684
8306,FDV50,14.30,Low Fat,0.122762,Dairy,124.3730,OUT049,1999,Medium,Tier 1,Supermarket Type1,2093.9410


In [62]:
# Item type that is dairy or has more than 14 item weight
data.loc[(data.Item_Type == 'Dairy') | (data.Item_Weight >=14)]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.30,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
2,FDN15,17.50,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.20,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
8,FDH17,16.20,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.20,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350
...,...,...,...,...,...,...,...,...,...,...,...,...
8514,FDA01,15.00,Regular,0.054489,Canned,57.5904,OUT045,2002,,Tier 2,Supermarket Type1,468.7232
8515,FDH24,20.70,Low Fat,0.021518,Baking Goods,157.5288,OUT018,2009,Medium,Tier 3,Supermarket Type2,1571.2880
8516,NCJ19,18.60,Low Fat,0.118661,Others,58.7588,OUT018,2009,Medium,Tier 3,Supermarket Type2,858.8820
8517,FDF53,20.75,reg,0.083607,Frozen Foods,178.8318,OUT046,1997,Small,Tier 1,Supermarket Type1,3608.6360


Pandas comes with a few built in conditional selectors

A) isin - selects dta whose value is in a list of values

In [64]:
# select products from Medium and Small
data.loc[data.Outlet_Size.isin(['Medium','Small'])]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
5,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
...,...,...,...,...,...,...,...,...,...,...,...,...
8516,NCJ19,18.600,Low Fat,0.118661,Others,58.7588,OUT018,2009,Medium,Tier 3,Supermarket Type2,858.8820
8517,FDF53,20.750,reg,0.083607,Frozen Foods,178.8318,OUT046,1997,Small,Tier 1,Supermarket Type1,3608.6360
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


B) isnull-Helps highlight empty values which are  empty (NaN)

  notnull -Highlights values which are null.

In [65]:
# FIlter out item_weight which are null
data.loc[data.Item_Weight.isnull()]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680
21,FDW12,,Regular,0.035400,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
...,...,...,...,...,...,...,...,...,...,...,...,...
8485,DRK37,,Low Fat,0.043792,Soft Drinks,189.0530,OUT027,1985,Medium,Tier 3,Supermarket Type3,6261.8490
8487,DRG13,,Low Fat,0.037006,Soft Drinks,164.7526,OUT027,1985,Medium,Tier 3,Supermarket Type3,4111.3150
8488,NCN14,,Low Fat,0.091473,Others,184.6608,OUT027,1985,Medium,Tier 3,Supermarket Type3,2756.4120
8490,FDU44,,Regular,0.102296,Fruits and Vegetables,162.3552,OUT019,1985,Small,Tier 1,Grocery Store,487.3656
