# Data Wrangling

### Table of contents
* [1.1 Intorduction](#1.1_Introduction)
    * [1.2 Purpose](#1.2_Purpose)
    * [1.3 Approach](#1.3_Approach)
* [2.1 Explore the data](#2.1)
    * [2.2 Removing extra data](#2.2)
* [3.1 Summary](#3.1_Summary)

##### 1.1 Introduction: <a id='1.1_Introduction'></a>

In order to stay in business, commercial grocery stores must offer prices that are commensurate with competitors, offer deals to entice customers, and accurately predict which products, and the quantity of those products, to keep in stock. These considerations are confounded by the effect of both seasonal and regional trends.

Especially for grocers, the consequences of poor inventory management are dire. Perishable items like fruits and vegetables can rot before selling if they are overstocked. Conversely, many locations do not have the real estate or capability to store overstocked, low-demand items that are not selling. According to Retail Wire, overstocking costs the average retailer 3.2% in lost revenue, while understocking items can cost 4.1%. A review of the data has shown that overstocks are costing retailers \\$123.4 billion every year, and understocks remove another \\$129.5 billion from net inflows. [1]

##### 1.2 Purpose<a id='1.2_Purpose'></a>

Using Kaggle data available from Favorita grocery stores located in Ecuador [2], we will assess and predict sales of available items using time series analysis. There are multiple datasets that will require some data to be merged. We have information on transactions, stores, regions, holidays, and even oil pricing.

[1] https://www.retailwire.com/discussion/retailers-suffer-the-high-cost-of-overstocks-and-out-of-stocks/

[2] Alexis Cook, DanB, inversion, Ryan Holbrook. (2021). Store Sales - Time Series Forecasting. Kaggle. https://kaggle.com/competitions/store-sales-time-series-forecasting


##### 1.3 Approach<a id='1.3_Approach'></a>

We will use machine learning time series analysis to forecast sales of different types of items across dozens of stores. This will allow Favorita to become more efficient with its distribution of resources, and more likely to attract customers to purchase certain products at certain times. This analysis can also inform the company of the best times to offer discounts, whether to stock up on certain items, and knowledge of general market trends.


##### 2.1 Explore the data<a id='2.1'></a>

In [2]:
# import needed modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [3]:
stores = pd.read_csv('./stores.csv')

In [4]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [5]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [6]:
# count missing values
missing = pd.concat([stores.isnull().sum(), 100 * stores.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
store_nbr,0,0.0
city,0,0.0
state,0,0.0
type,0,0.0
cluster,0,0.0


In [7]:
print("{} clusters\n\n{}".format(len(stores.cluster.unique()),stores.cluster.value_counts()))


17 clusters

3     7
6     6
10    6
15    5
13    4
14    4
11    3
4     3
8     3
1     3
9     2
7     2
2     2
12    1
5     1
16    1
17    1
Name: cluster, dtype: int64


In [8]:
print("{} cities\n\n{}".format(len(stores.city.unique()),stores.city.value_counts()))


22 cities

Quito            18
Guayaquil         8
Cuenca            3
Santo Domingo     3
Manta             2
Latacunga         2
Machala           2
Ambato            2
Quevedo           1
Esmeraldas        1
Loja              1
Libertad          1
Playas            1
Daule             1
Babahoyo          1
Salinas           1
Puyo              1
Guaranda          1
Ibarra            1
Riobamba          1
Cayambe           1
El Carmen         1
Name: city, dtype: int64


In [9]:
print("{} types\n\n{}".format(len(stores.type.unique()),stores.type.value_counts()))


5 types

D    18
C    15
A     9
B     8
E     4
Name: type, dtype: int64


In [10]:
print("{} states\n\n{}".format(len(stores.state.unique()),stores.state.value_counts()))

16 states

Pichincha                         19
Guayas                            11
Santo Domingo de los Tsachilas     3
Azuay                              3
Manabi                             3
Cotopaxi                           2
Tungurahua                         2
Los Rios                           2
El Oro                             2
Chimborazo                         1
Imbabura                           1
Bolivar                            1
Pastaza                            1
Santa Elena                        1
Loja                               1
Esmeraldas                         1
Name: state, dtype: int64


In [11]:
holidays = pd.read_csv('./holidays_events.csv')

In [12]:
holidays.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [13]:
holidays.rename(columns={'type':'holiday_type'},inplace=True)

In [14]:
holidays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          350 non-null    object
 1   holiday_type  350 non-null    object
 2   locale        350 non-null    object
 3   locale_name   350 non-null    object
 4   description   350 non-null    object
 5   transferred   350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


In [15]:
holidays["date"] = pd.to_datetime(holidays["date"])

In [16]:
# count missing values
missing = pd.concat([holidays.isnull().sum(), 100 * holidays.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
date,0,0.0
holiday_type,0,0.0
locale,0,0.0
locale_name,0,0.0
description,0,0.0
transferred,0,0.0


In [17]:
print("{} holiday types\n\n{}".format(len(holidays.holiday_type.unique()),holidays.holiday_type.value_counts()))

6 holiday types

Holiday       221
Event          56
Additional     51
Transfer       12
Bridge          5
Work Day        5
Name: holiday_type, dtype: int64


In [18]:
print("{} locales\n\n{}".format(len(holidays.locale.unique()),holidays.locale.value_counts()))

3 locales

National    174
Local       152
Regional     24
Name: locale, dtype: int64


In [19]:
print("{} locale names\n\n{}".format(len(holidays.locale_name.unique()),holidays.locale_name.value_counts()))

24 locale names

Ecuador                           174
Quito                              13
Riobamba                           12
Guaranda                           12
Latacunga                          12
Ambato                             12
Guayaquil                          11
Cuenca                              7
Ibarra                              7
Salinas                             6
Loja                                6
Santa Elena                         6
Santo Domingo de los Tsachilas      6
Quevedo                             6
Manta                               6
Esmeraldas                          6
Cotopaxi                            6
El Carmen                           6
Santo Domingo                       6
Machala                             6
Imbabura                            6
Puyo                                6
Libertad                            6
Cayambe                             6
Name: locale_name, dtype: int64


There appears to be a mix of City, State, and Country locale names here.

In [20]:
list(holidays[holidays.locale == "Local"].locale_name.unique())

['Manta',
 'Cuenca',
 'Libertad',
 'Riobamba',
 'Puyo',
 'Guaranda',
 'Latacunga',
 'Machala',
 'Santo Domingo',
 'El Carmen',
 'Cayambe',
 'Esmeraldas',
 'Ambato',
 'Ibarra',
 'Quevedo',
 'Quito',
 'Loja',
 'Salinas',
 'Guayaquil']

"Local" locale refers to a city holiday.

In [21]:
list(holidays[holidays.locale == "Regional"].locale_name.unique())

['Cotopaxi', 'Imbabura', 'Santo Domingo de los Tsachilas', 'Santa Elena']

"Regional" locale refers to a state (province) holiday

In [22]:
list(holidays[holidays.locale == "National"].locale_name.unique())

['Ecuador']

Logically, all "National" holidays are of the country locale of Ecuador.

Therefore, we can make new dataframes for each type of locale, for use later when merging

In [23]:
local_holiday = holidays[holidays.locale == "Local"]
regional_holiday = holidays[holidays.locale == "Regional"]
national_holiday = holidays[holidays.locale == "National"]

In [24]:
print("{} descriptions\n\n".format(len(holidays.description.unique())))

103 descriptions




In [25]:
print("{} transfer outcomes\n\n{}".format(len(holidays.transferred.unique()),holidays.transferred.value_counts()))

2 transfer outcomes

False    338
True      12
Name: transferred, dtype: int64


In [26]:
transactions = pd.read_csv('./transactions.csv')

In [27]:
transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [28]:
transactions["date"] = pd.to_datetime(transactions["date"])

In [29]:
transactions.transactions.describe()

count    83488.000000
mean      1694.602158
std        963.286644
min          5.000000
25%       1046.000000
50%       1393.000000
75%       2079.000000
max       8359.000000
Name: transactions, dtype: float64

In [30]:
missing = pd.concat([transactions.isnull().sum(), 100 * transactions.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
date,0,0.0
store_nbr,0,0.0
transactions,0,0.0


In [31]:
print("{} stores\n\n{}".format(len(transactions.store_nbr.unique()),transactions.store_nbr.value_counts()))

54 stores

39    1678
38    1678
26    1678
31    1678
33    1678
34    1678
37    1678
27    1677
28    1677
32    1677
23    1677
40    1677
41    1677
44    1677
45    1677
46    1677
47    1677
48    1677
50    1677
51    1677
49    1677
2     1677
16    1677
5     1677
54    1676
3     1676
4     1676
6     1676
8     1676
9     1676
19    1676
35    1676
13    1676
1     1676
15    1676
11    1676
10    1675
7     1675
17    1674
43    1672
30    1655
14    1638
12    1616
25    1615
24    1577
18    1566
36    1551
53    1167
20     909
29     874
21     748
42     720
22     671
52     118
Name: store_nbr, dtype: int64


In [32]:
oil = pd.read_csv('./oil.csv')

In [33]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [34]:
oil.describe()

Unnamed: 0,dcoilwtico
count,1175.0
mean,67.714366
std,25.630476
min,26.19
25%,46.405
50%,53.19
75%,95.66
max,110.62


In [35]:
missing = pd.concat([oil.isnull().sum(), 100 * oil.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
dcoilwtico,43,3.530378
date,0,0.0


In [36]:
train_csv = pd.read_csv('./train.csv')

In [37]:
train_csv.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [38]:
missing = pd.concat([train_csv.isnull().sum(), 100 * train_csv.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

Unnamed: 0,count,%
id,0,0.0
date,0,0.0
store_nbr,0,0.0
family,0,0.0
sales,0,0.0
onpromotion,0,0.0


In [39]:
train_csv.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

In [40]:
train_csv["date"] = pd.to_datetime(train_csv["date"])
train_csv.dtypes

id                      int64
date           datetime64[ns]
store_nbr               int64
family                 object
sales                 float64
onpromotion             int64
dtype: object

In [41]:
train_df = train_csv.set_index('date')

In [42]:
train_df

Unnamed: 0_level_0,id,store_nbr,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,0,1,AUTOMOTIVE,0.000,0
2013-01-01,1,1,BABY CARE,0.000,0
2013-01-01,2,1,BEAUTY,0.000,0
2013-01-01,3,1,BEVERAGES,0.000,0
2013-01-01,4,1,BOOKS,0.000,0
...,...,...,...,...,...
2017-08-15,3000883,9,POULTRY,438.133,0
2017-08-15,3000884,9,PREPARED FOODS,154.553,1
2017-08-15,3000885,9,PRODUCE,2419.729,148
2017-08-15,3000886,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [43]:
print("{} product families\n\n{}".format(len(train_df.family.unique()),train_df.family.value_counts()))

33 product families

AUTOMOTIVE                    90936
HOME APPLIANCES               90936
SCHOOL AND OFFICE SUPPLIES    90936
PRODUCE                       90936
PREPARED FOODS                90936
POULTRY                       90936
PLAYERS AND ELECTRONICS       90936
PET SUPPLIES                  90936
PERSONAL CARE                 90936
MEATS                         90936
MAGAZINES                     90936
LIQUOR,WINE,BEER              90936
LINGERIE                      90936
LAWN AND GARDEN               90936
LADIESWEAR                    90936
HOME CARE                     90936
HOME AND KITCHEN II           90936
BABY CARE                     90936
HOME AND KITCHEN I            90936
HARDWARE                      90936
GROCERY II                    90936
GROCERY I                     90936
FROZEN FOODS                  90936
EGGS                          90936
DELI                          90936
DAIRY                         90936
CLEANING                      90936
CELEBRA

In [44]:
print("{} promotion types\n\n{}".format(len(train_df.onpromotion.unique()),train_df.onpromotion.value_counts()))

362 promotion types

0      2389559
1       174551
2        79386
3        45862
4        31659
        ...   
313          1
452          1
642          1
305          1
425          1
Name: onpromotion, Length: 362, dtype: int64


In [45]:
train_df.sales.describe()

count    3.000888e+06
mean     3.577757e+02
std      1.101998e+03
min      0.000000e+00
25%      0.000000e+00
50%      1.100000e+01
75%      1.958473e+02
max      1.247170e+05
Name: sales, dtype: float64

Make list of unique "family" names

In [46]:
family_values = train_df.family.unique()

Make a list of all unique store numbers

In [47]:
store_id = train_df.store_nbr.unique()

Now we can start to combine all of the pertinent columns from the various dataframes

In [48]:
train_csv

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


First, we combine the train_csv sales data with geographical store information

In [49]:
store_info_1 = train_csv.merge(stores,on=['store_nbr'],how='left')
store_info_1

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13
...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6


Next, we combine the prior dataframe with holiday information; local first.

In [50]:
store_info_2 = store_info_1.merge(local_holiday,how='left',right_on=['locale_name','date'],
                                  left_on=['city','date']).drop(['locale_name','description'], axis=1)

# .drop_duplicates(['date','store_nbr','family']).reset_index()
store_info_2

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3001147,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,
3001148,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,
3001149,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,
3001150,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,


Unfortunately, there are more rows in the new dataframe than the prior one. We need to investigate and correct this issue.

Since the prior and new dataframes both start and end on the same 'id', there must be dates where the holiday dataframe assigns multiple values.

We make a list of 'id's that appear more than once in the new dataframe.

In [66]:
test = store_info_2.id.value_counts() > 1
test


2312781     True
2312104     True
2312096     True
2312097     True
2312098     True
           ...  
1000298    False
1000299    False
1000300    False
1000301    False
3000887    False
Name: id, Length: 3000888, dtype: bool

Contracting the dataframe into only unique values and their number of occurrances has reduced the length of the dataframe to the original size (3,000,888 rows).

In [69]:
test2 = pd.DataFrame(store_info_2[(store_info_2.date > '2013-01-01') & (store_info_2.id.value_counts() > 1)])
test2.date.value_counts()

2016-07-24    264
Name: date, dtype: int64

It appears that on a single date, we have 264 instances of multiple holidays assigned to a single date.

In [70]:
test2[(test2.date == '2016-07-24') & (test2.id.value_counts() > 1)]

  test2[(test2.date == '2016-07-24') & (test2.id.value_counts() > 1)]


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
2311782,2311782,2016-07-24,24,AUTOMOTIVE,2.000,0,Guayaquil,Guayas,D,1,Additional,Local,False
2311783,2311782,2016-07-24,24,AUTOMOTIVE,2.000,0,Guayaquil,Guayas,D,1,Transfer,Local,False
2311784,2311783,2016-07-24,24,BABY CARE,1.000,0,Guayaquil,Guayas,D,1,Additional,Local,False
2311785,2311783,2016-07-24,24,BABY CARE,1.000,0,Guayaquil,Guayas,D,1,Transfer,Local,False
2311786,2311784,2016-07-24,24,BEAUTY,6.000,1,Guayaquil,Guayas,D,1,Additional,Local,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2312024,2311936,2016-07-24,28,"LIQUOR,WINE,BEER",640.000,0,Guayaquil,Guayas,E,10,Additional,Local,False
2312025,2311936,2016-07-24,28,"LIQUOR,WINE,BEER",640.000,0,Guayaquil,Guayas,E,10,Transfer,Local,False
2312026,2311937,2016-07-24,28,MAGAZINES,2.000,0,Guayaquil,Guayas,E,10,Additional,Local,False
2312027,2311937,2016-07-24,28,MAGAZINES,2.000,0,Guayaquil,Guayas,E,10,Transfer,Local,False


According to documentation,

"A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday...

Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday)."

The issue is that Simón Bolívar Day falls on 07/24 every year, so it cannot be a transfer to that date.

Therefore, we will eliminate the rows on 2016-07-24 that are categorized as "Transfer"

In [71]:
local_holiday2 = local_holiday[local_holiday['holiday_type'].str.contains("Transfer")==False]

In [72]:
local_holiday2

Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
...,...,...,...,...,...,...
338,2017-11-12,Holiday,Local,Ambato,Independencia de Ambato,False
339,2017-12-05,Additional,Local,Quito,Fundacion de Quito-1,False
340,2017-12-06,Holiday,Local,Quito,Fundacion de Quito,True
341,2017-12-08,Holiday,Local,Loja,Fundacion de Loja,False


In [73]:
store_info_2b = store_info_1.merge(local_holiday2,how='left',right_on=['locale_name','date'],
                                  left_on=['city','date']).drop(['locale_name','description'], axis=1)

store_info_2b

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,


Now we have the correct dataframe size, so we can continue by merging the regional holidays.

In [74]:
store_info_3 = store_info_2b.merge(regional_holiday,how='left',right_on=['locale_name','date'],
                                  left_on=['state','date']).drop(['locale_name','description'], axis=1)

store_info_3

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,,,,
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,,,,
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,,,,
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,,,,
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,,,,
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,,,,
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,,,,


In [75]:
store_info_4 = store_info_3.merge(national_holiday,how='left',on=['date']).drop(['locale_name','description'], axis=1)

store_info_4

  output = repr(obj)
  return method()


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3008011,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,,,,,,,
3008012,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,,,,,,,
3008013,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,,,,,,,
3008014,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,,,,,,,


We have run into another duplicate rows issue.

In [76]:
test5 = pd.DataFrame(store_info_4[(store_info_4.date > '2012-01-01') & (store_info_4.id.value_counts() > 1)])
test5.date.value_counts()

2014-12-26    1782
2016-04-30    1782
2016-05-05    1782
2016-05-06    1782
Name: date, dtype: int64

In [77]:
test5

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
1286604,1286604,2014-12-26,1,AUTOMOTIVE,1.000,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
1286605,1286604,2014-12-26,1,AUTOMOTIVE,1.000,0,Quito,Pichincha,D,13,,,,,,,Additional,National,False
1286606,1286605,2014-12-26,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
1286607,1286605,2014-12-26,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,Additional,National,False
1286608,1286606,2014-12-26,1,BEAUTY,3.000,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2175817,2172253,2016-05-06,9,POULTRY,422.535,18,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175818,2172254,2016-05-06,9,PREPARED FOODS,72.240,1,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175819,2172255,2016-05-06,9,PRODUCE,1100.665,1,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175820,2172256,2016-05-06,9,SCHOOL AND OFFICE SUPPLIES,3.000,0,Quito,Pichincha,B,6,,,,,,,Event,National,False


In [78]:
national_holiday[national_holiday.date == '2016-04-16']

Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred
219,2016-04-16,Event,National,Ecuador,Terremoto Manabi,False


In [79]:
test5[(test5.date == '2014-12-26') & (test5.id.value_counts() > 1)]

  test5[(test5.date == '2014-12-26') & (test5.id.value_counts() > 1)]


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
1286604,1286604,2014-12-26,1,AUTOMOTIVE,1.0,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
1286605,1286604,2014-12-26,1,AUTOMOTIVE,1.0,0,Quito,Pichincha,D,13,,,,,,,Additional,National,False
1286606,1286605,2014-12-26,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
1286607,1286605,2014-12-26,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,,,,,,,Additional,National,False
1286608,1286606,2014-12-26,1,BEAUTY,3.0,0,Quito,Pichincha,D,13,,,,,,,Bridge,National,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1287490,1287047,2014-12-26,21,HARDWARE,0.0,0,Santo Domingo,Santo Domingo de los Tsachilas,B,6,,,,,,,Bridge,National,False
1287491,1287047,2014-12-26,21,HARDWARE,0.0,0,Santo Domingo,Santo Domingo de los Tsachilas,B,6,,,,,,,Additional,National,False
1287492,1287048,2014-12-26,21,HOME AND KITCHEN I,0.0,0,Santo Domingo,Santo Domingo de los Tsachilas,B,6,,,,,,,Bridge,National,False
1287493,1287048,2014-12-26,21,HOME AND KITCHEN I,0.0,0,Santo Domingo,Santo Domingo de los Tsachilas,B,6,,,,,,,Additional,National,False


Specifically for 2014-12-26 (the day after Christmas), the dataset categorizes the day as both a "Bridge" and an "Additional" holiday day. We will condense this into just an "Additional" day (based on how the data treats 12-26 in other years).

In [80]:
national_holiday2 = national_holiday.drop(national_holiday[(national_holiday['holiday_type'] == "Bridge") & (national_holiday['date'] == '2014-12-26')].index)

In [81]:
national_holiday2

Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred
14,2012-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,False
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
21,2012-11-02,Holiday,National,Ecuador,Dia de Difuntos,False
22,2012-11-03,Holiday,National,Ecuador,Independencia de Cuenca,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [82]:
national_holiday2.date.value_counts() > 1

2012-12-31     True
2016-05-08     True
2016-05-07     True
2012-12-24     True
2016-05-01     True
              ...  
2014-10-10    False
2014-11-02    False
2014-11-03    False
2014-11-28    False
2017-12-26    False
Name: date, Length: 168, dtype: bool

In [83]:
national_holiday2[(national_holiday2.date.value_counts() > 1) & (national_holiday2['date'] > '2011-12-26')]

  national_holiday2[(national_holiday2.date.value_counts() > 1) & (national_holiday2['date'] > '2011-12-26')]
  national_holiday2[(national_holiday2.date.value_counts() > 1) & (national_holiday2['date'] > '2011-12-26')]


Unnamed: 0,date,holiday_type,locale,locale_name,description,transferred


In [84]:
test6 = pd.DataFrame(store_info_4[(store_info_4.date > '2010-12-26') & (test5.id.value_counts() > 1)])
test6.date.value_counts()

2014-12-26    891
Name: date, dtype: int64

In [85]:
store_info_5 = store_info_3.merge(national_holiday2,how='left',on=['date']).drop(['locale_name','description'], axis=1)

store_info_5

  output = repr(obj)
  return method()


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
1,1,2013-01-01,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
2,2,2013-01-01,1,BEAUTY,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
3,3,2013-01-01,1,BEVERAGES,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
4,4,2013-01-01,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,,Holiday,National,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3006229,3000883,2017-08-15,9,POULTRY,438.133,0,Quito,Pichincha,B,6,,,,,,,,,
3006230,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Quito,Pichincha,B,6,,,,,,,,,
3006231,3000885,2017-08-15,9,PRODUCE,2419.729,148,Quito,Pichincha,B,6,,,,,,,,,
3006232,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,Quito,Pichincha,B,6,,,,,,,,,


In [88]:
test5b = test5[test5.date != '2014-12-26']
test5b

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
2161566,2159784,2016-04-30,1,AUTOMOTIVE,12.000,0,Quito,Pichincha,D,13,,,,,,,Event,National,False
2161567,2159785,2016-04-30,1,BABY CARE,0.000,0,Quito,Pichincha,D,13,,,,,,,Event,National,False
2161568,2159786,2016-04-30,1,BEAUTY,3.000,0,Quito,Pichincha,D,13,,,,,,,Event,National,False
2161569,2159787,2016-04-30,1,BEVERAGES,2556.000,25,Quito,Pichincha,D,13,,,,,,,Event,National,False
2161570,2159788,2016-04-30,1,BOOKS,0.000,0,Quito,Pichincha,D,13,,,,,,,Event,National,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2175817,2172253,2016-05-06,9,POULTRY,422.535,18,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175818,2172254,2016-05-06,9,PREPARED FOODS,72.240,1,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175819,2172255,2016-05-06,9,PRODUCE,1100.665,1,Quito,Pichincha,B,6,,,,,,,Event,National,False
2175820,2172256,2016-05-06,9,SCHOOL AND OFFICE SUPPLIES,3.000,0,Quito,Pichincha,B,6,,,,,,,Event,National,False


In [100]:
test5b[(test5b.family=="AUTOMOTIVE")&(test5b.cluster==3)].sort_values('id')

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,holiday_type_x,locale_x,transferred_x,holiday_type_y,locale_y,transferred_y,holiday_type,locale,transferred
2161797,2160015,2016-04-30,16,AUTOMOTIVE,6.0,0,Santo Domingo,Santo Domingo de los Tsachilas,C,3,,,,,,,Event,National,False
2162325,2160543,2016-04-30,30,AUTOMOTIVE,1.0,0,Guayaquil,Guayas,C,3,,,,,,,Event,National,False
2162391,2160609,2016-04-30,32,AUTOMOTIVE,6.0,0,Guayaquil,Guayas,C,3,,,,,,,Event,National,False
2162424,2160642,2016-04-30,33,AUTOMOTIVE,6.0,0,Quevedo,Los Rios,C,3,,,,,,,Event,National,False
2162490,2160708,2016-04-30,35,AUTOMOTIVE,8.0,0,Playas,Guayas,C,3,,,,,,,Event,National,False
2162688,2160906,2016-04-30,40,AUTOMOTIVE,4.0,0,Machala,El Oro,C,3,,,,,,,Event,National,False
2163183,2161401,2016-04-30,54,AUTOMOTIVE,8.0,4,El Carmen,Manabi,C,3,,,,,,,Event,National,False
2172489,2168925,2016-05-05,16,AUTOMOTIVE,6.0,0,Santo Domingo,Santo Domingo de los Tsachilas,C,3,,,,,,,Event,National,False
2173017,2169453,2016-05-05,30,AUTOMOTIVE,2.0,0,Guayaquil,Guayas,C,3,,,,,,,Event,National,False
2173083,2169519,2016-05-05,32,AUTOMOTIVE,3.0,0,Guayaquil,Guayas,C,3,,,,,,,Event,National,False


In [97]:
si3 = store_info_3[(store_info_3.id >2000000)&(store_info_3.id < 3000000)]

In [99]:
si3.to_csv('si3.csv')

In [104]:
train_df.id.value_counts().sort_values()

0          1
24         1
22         1
89         1
21         1
          ..
3000881    1
3000882    1
3000883    1
3000885    1
3000887    1
Name: id, Length: 3000888, dtype: int64

In [86]:
cut = ["Date","b","e","LBE","DR","SegFile","A","B","C","D","E","AD","DE","LD","FS","SUSP"]

ctg = ctg_file.loc[:, ~ctg_file.columns.isin(cut)]
ctg.head()

NameError: name 'ctg_file' is not defined

In [None]:
cols = ctg.iloc[:,list(range(0,21))]
cols.head()

In [None]:
cat_cols = ctg.iloc[:,[0,-3,-2,-1]]
cat_cols.head()

In [None]:
ctg_mean = cols.groupby("FileName").agg('mean',numeric_only=True)
ctg_mode = cat_cols.groupby("FileName").agg(max)
ctg_fe = pd.concat([ctg_mean,ctg_mode],axis=1)
ctg_fe

In [None]:
ctg_std = ctg.groupby("FileName").agg({'std'},numeric_only=True)

In [None]:
ctg_mean.iloc[0,:]

In [None]:
for i in ctg_mean.columns:
    plt.boxplot(ctg_mean[i])
    plt.show()

Our target feature is 'NSP', and another possible target is 'CLASS'
Both are categorical variables.

In [None]:
# Inspection:

ctg_g['NSP'].value_counts()

In [None]:
ctg_file['CLASS'].value_counts()

We will call the `hist` method to plot histograms of each of the numeric features to visualize distributions for each feature

In [None]:

ctg_file.hist(figsize=(20,15))
plt.subplots_adjust(hspace=1.0);
# the terminating ';' fixes some messy output

In [None]:
# Checking to see if there are any unique values
ctg_file["DR"].value_counts()

In [None]:
# Checking to see if there are any unique values
ctg_file["DS"].value_counts()

After reaching out to the original researchers to inquire what the other columns signify, we did not receive any response. Due to these circumstances, we will need to eliminate all other columns

##### 2.2 Removing extra data<a id='2.2'></a>

In [None]:
ctg = ctg_file

cut = ["FileName","Date","b","e","LBE","DR","SegFile","A","B","C","D","E","AD","DE","LD","FS","SUSP"]

ctg = ctg_file.loc[:, ~ctg_file.columns.isin(cut)]
ctg.head()

In [None]:
# double check for missing values
missing = pd.concat([ctg.isnull().sum(), 100 * ctg.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by="count" ,ascending=False)

In [None]:
# final check of the data we wrangled and cleaned before advancing to next steps (1/3)
ctg.iloc[:,0:7].describe()

In [None]:
# final check of the data we wrangled and cleaned before advancing to next steps (2/3)
ctg.iloc[:,8:15].describe()

In [None]:
# final check of the data we wrangled and cleaned before advancing to next steps (3/3)
ctg.iloc[:,16:].describe()

We export the finalized data for future use as a .csv file

In [None]:
f = r'C:\Users\Joseph Shire\Documents\Springboard Python Data Science\Python Scripts\springboard\Capstone2\Fetal health idea\ctg.csv'
ctg.to_csv(f)

##### 3.1 Summary<a id='3.1_Summary'></a>

The data was loaded, inspected, cleaned, saved, and is now ready for exploratory data analysis. There are no more missing values, and the only columns still in the dataset are necessary for proper classification modelling.

The next step (in a different notebook) will be exploratory data analysis. Already, it appears that we will need to use dummy encoding (pandas 'get_dummies' method) in order to transform our categorical NSP target into '0 or 1' variables comprising 3 columns (one for each outcome possibility: N, S, or P).