
- **`sales_train.csv`** Rows: 2935849 sales (January 2013 -> Octuber 2015)
  - **date**: date in format dd/mm/yyyy.
  - **date_block_num**: a consecutive month number. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
  - **shop_id**: unique identifier of a shop
  - **item_id**: unique identifier of a product
  - **item_price**: current price of an item
  - **item_cnt_day**: number of products sold. You are predicting a monthly amount of this measure.
- **`shops.csv`** Rows: 60 shops
  - **shop_id**
  - **shop_name**: name of shop (RUSSIAN 🇷🇺)
- **`items.csv`** Rows: 22170 products
  - **item_id**
  - **item_name**: name of item (RUSSIAN 🇷🇺)
  - **item_category_id**: unique identifier of item category
- **`item_categories.csv`** Rows: 84 product categories
  - **item_category_id**
  - **item_category_name**: name of item category (RUSSIAN 🇷🇺)
- **`test.csv`** Rows: 214200 pairs combination of (Shop, Item)
  - **ID**: an Id that represents a (Shop, Item) tuple within the test set
  - **shop_id**
  - **item_id**


In [72]:
!pip install geopy

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
Collecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0


In [73]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import missingno as m
import seaborn as sns
from sklearn.ensemble import IsolationForest
from scipy import stats
import matplotlib as plt

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

from geopy.geocoders import Nominatim

path = "../../datasets/predict-future-sales/"

train = pd.read_csv(path+"sales_train.csv") # Daily sales  Jan 2013 -> Oct 2015
shops = pd.read_csv(path+"shops-translated.csv")       # Shops    (60)
items = pd.read_csv(path+"items-translated.csv")       # Products  (22170)
oritem = pd.read_csv(path+"items.csv")  
cats  = pd.read_csv(path+"item_categories-translated.csv") # Product categories (84)
test  = pd.read_csv(path+"test.csv", index_col="ID") # predict November 2015
sub   = pd.read_csv(path+"sample_submission.csv", index_col="ID")


In [None]:
Exercise 1: Detect repeted shops
There are 4 repeted shops (8 different shop_ids). Look at the shop_name and find repeated shops. This is important because need to remove duplicated shops in the future.

Please complete this:

Ids of repepeated shop 1: _11___ and _10___
Ids of repepeated shop 2: _39___ and _40___
Ids of repepeated shop 3: _23___ and _24___
Ids of repepeated shop 4: _0__ and __1__

In [4]:
shops

Unnamed: 0,shop_id,shop_name_translated
0,0,"Yakutsk Ordzhonikidze, 56 francs"
1,1,"Yakutsk TC ""Central"" franc"
2,2,"Adygea Shopping Center ""Mega"""
3,3,"Balashikha TRK ""October-Kinomir"""
4,4,"Volzhsky shopping center ""Volga Mall"""
5,5,"Vologda Shopping and Entertainment Center ""Mar..."
6,6,"Voronezh (Plekhanovskaya, 13)"
7,7,"Voronezh TRC ""Maksimir"""
8,8,"Voronezh TRC City-Park ""Grad"""
9,9,Outbound Trade


In [11]:
gshops = shops.groupby(['shop_name_translated', 'shop_id'])['shop_id'].count()

In [12]:
gshops

shop_name_translated                                   shop_id
Adygea Shopping Center "Mega"                          2          1
Balashikha TRK "October-Kinomir"                       3          1
Chekhov SEC "Carnival"                                 56         1
Colosseum "Rio"                                        16         1
Digital warehouse 1C-Online                            55         1
Kaluga TRC "XXI Century"                               15         1
Kazan TC "Behetle"                                     13         1
Kazan TC "ParkHaus" II                                 14         1
Khimki ТЦ "Mega"                                       54         1
Krasnoyarsk Shopping center "June"                     18         1
Krasnoyarsk Shopping center "Vzletka Plaza"            17         1
Kursk TC "Pushkinsky"                                  19         1
Moscow "Sale"                                          20         1
Moscow MTRTS "Afi Mall"                              

Exercise 2: Pysical vs Online shop
Create a binary column to determint if the shop is online or not.

In [28]:
online = shops[shops['shop_name_translated'].str.contains('Online', 'Digital')]

In [67]:
online

Unnamed: 0,shop_id,shop_name_translated
12,12,Online shop Emergency
55,55,Digital warehouse 1C-Online


In [68]:
shops['online'] = shops['shop_name_translated'].str.contains('Online', 'Digital')

Exercise 3: Get the cities
Extract the city name of each shop. Cities apears on the first word of the shop name.

In [69]:
shops['shop_name_translated'] = shops['shop_name_translated'].astype(str)
shops['shop_name_translated'].dtype

dtype('O')

In [70]:
def get_shop_name(string):
        string_ret = string.split(" ")[0]
        return (string_ret)
    
shops['City'] = shops['shop_name_translated'].apply(get_shop_name)

In [146]:
shops.loc[shops.shop_id.isin([9,12,55]), 'City'] = np.nan
shops.loc[shops.shop_id.isin([34]), 'City'] = 'Novgorod'
shops.loc[shops.shop_id.isin([33]), 'City'] = 'Mytishchi'
shops.loc[shops.shop_id.isin([39,40,41]), 'City'] = 'Rostov-On-Don'

In [147]:
shops

Unnamed: 0,shop_id,shop_name_translated,City,online,Latitude,Longitude
0,0,"Yakutsk Ordzhonikidze, 56 francs",Yakutsk,False,55.750446,37.617494
1,1,"Yakutsk TC ""Central"" franc",Yakutsk,False,55.750446,37.617494
2,2,"Adygea Shopping Center ""Mega""",Adygea,False,55.750446,37.617494
3,3,"Balashikha TRK ""October-Kinomir""",Balashikha,False,55.750446,37.617494
4,4,"Volzhsky shopping center ""Volga Mall""",Volzhsky,False,55.750446,37.617494
5,5,"Vologda Shopping and Entertainment Center ""Mar...",Vologda,False,55.750446,37.617494
6,6,"Voronezh (Plekhanovskaya, 13)",Voronezh,False,55.750446,37.617494
7,7,"Voronezh TRC ""Maksimir""",Voronezh,False,55.750446,37.617494
8,8,"Voronezh TRC City-Park ""Grad""",Voronezh,False,55.750446,37.617494
9,9,Outbound Trade,,False,55.750446,37.617494


Exercise 4: Latitude & Longitude
Get Latitude an Longitude of each city. You can use the geopy package.

In [148]:
import pandas, os, geopy
from geopy.geocoders import Nominatim


GeoLocator = Nominatim(user_agent="bence")

longitude = []
latitude = []
city_list = []


for city in shops['City']:
    
    if city is None:
        pass
        
        
    else:    
            print(city)
            location = GeoLocator.geocode(query = city)
            print('Latitude = {}, Longitude = {}'.format(location.latitude, location.longitude))
            city_list.append(location)
            latitude.append(location.latitude)
            longitude.append(location.longitude)

Yakutsk
Latitude = 62.027287, Longitude = 129.732086
Yakutsk
Latitude = 62.027287, Longitude = 129.732086
Adygea
Latitude = 44.6939006, Longitude = 40.1520421
Balashikha
Latitude = 55.8036225, Longitude = 37.9646488
Volzhsky
Latitude = 48.782102, Longitude = 44.7779843
Vologda
Latitude = 59.218876, Longitude = 39.893276
Voronezh
Latitude = 51.6605982, Longitude = 39.2005858
Voronezh
Latitude = 51.6605982, Longitude = 39.2005858
Voronezh
Latitude = 51.6605982, Longitude = 39.2005858
nan
Latitude = 46.3144754, Longitude = 11.0480288
Zhukovsky
Latitude = 55.5972801, Longitude = 38.1199863
Zhukovsky
Latitude = 55.5972801, Longitude = 38.1199863
nan
Latitude = 46.3144754, Longitude = 11.0480288
Kazan
Latitude = 55.7823547, Longitude = 49.1242266
Kazan
Latitude = 55.7823547, Longitude = 49.1242266
Kaluga
Latitude = 54.5101087, Longitude = 36.2598115
Colosseum
Latitude = 41.8902614, Longitude = 12.493087103595503
Krasnoyarsk
Latitude = 63.3233807, Longitude = 97.0979974
Krasnoyarsk
Latitude =

In [153]:
 #shops['Location'] = pd.DataFrame(city_list)
shops['Latitude'] = pd.DataFrame(latitude)
shops['Longitude'] = pd.DataFrame(longitude)

In [154]:
shops

Unnamed: 0,shop_id,shop_name_translated,City,online,Latitude,Longitude
0,0,"Yakutsk Ordzhonikidze, 56 francs",Yakutsk,False,62.027287,129.732086
1,1,"Yakutsk TC ""Central"" franc",Yakutsk,False,62.027287,129.732086
2,2,"Adygea Shopping Center ""Mega""",Adygea,False,44.693901,40.152042
3,3,"Balashikha TRK ""October-Kinomir""",Balashikha,False,55.803623,37.964649
4,4,"Volzhsky shopping center ""Volga Mall""",Volzhsky,False,48.782102,44.777984
5,5,"Vologda Shopping and Entertainment Center ""Mar...",Vologda,False,59.218876,39.893276
6,6,"Voronezh (Plekhanovskaya, 13)",Voronezh,False,51.660598,39.200586
7,7,"Voronezh TRC ""Maksimir""",Voronezh,False,51.660598,39.200586
8,8,"Voronezh TRC City-Park ""Grad""",Voronezh,False,51.660598,39.200586
9,9,Outbound Trade,,False,46.314475,11.048029
