In [12]:
import pandas as pd 
import datatable as dt

If you read the questions carefully, you realise that the real exploratory analysis is done by answering the questions themselves, so what you will do here is just take a first look at the datasets to get a feel for the variables you will have to work with.
Furthermore, the datasets will be cleaned in the most congenial way to answer each question as the questions have been divided among the group members: the choice not to eliminate null values all together is, therefore, a voluntary one. 

### Import data: INSTAGRAM PROFILES

In [2]:
profiles = pd.read_csv(r"D:\DATASET_ADM-HW2\archive\instagram_profiles.csv", delimiter='\t')

### EXPLORATORY ANALYSIS

In [3]:
profiles.shape

(4509586, 11)

In [5]:
for i in profiles.columns:
    print(i)


sid
profile_id
profile_name
firstname_lastname
description
following
followers
n_posts
url
cts
is_business_account


In [6]:
profiles.head(10)

Unnamed: 0,sid,profile_id,profile_name,firstname_lastname,description,following,followers,n_posts,url,cts,is_business_account
0,4184446,4721050000.0,jphillip033,John Pierce,"""Document Everything Always""",250.0,146.0,170.0,,2019-08-01 14:38:55.394 -0400,False
1,4184457,590583000.0,mama_haas,Deanna,Trying to enjoy the simple things in life. Kni...,534.0,1145.0,2878.0,www.etsy.com/shop/MamaHaas69,2019-08-01 14:39:36.526 -0400,False
2,4184460,1074147000.0,yellowlightbreen,Eliza Gray,Maine is for Lovers,469.0,324.0,431.0,elizajanegray.com,2019-08-01 14:39:54.407 -0400,False
3,4184461,1472039000.0,tec1025,Thomas Clark,,,,,,2019-08-01 14:40:06.472 -0400,
4,4184462,3531421000.0,luckyfluff,,,,,,,2019-08-01 14:40:07.806 -0400,
5,4184465,145064200.0,sabahlke,Sarah bahlke,,266.0,192.0,590.0,,2019-08-01 14:40:16.443 -0400,False
6,4184471,2061868000.0,masslivehs,MassLive High School Sports,Your spot for the best Western Mass. high scho...,157.0,4137.0,753.0,bit.ly/2HIysyv,2019-08-01 14:40:40.390 -0400,True
7,4184472,1446651000.0,hvcanes,Hoosac Valley,,,,,,2019-08-01 14:40:52.635 -0400,
8,4184475,1743726000.0,will_jay_k,William Kramer,I’d rather die a big death than live a small life,115.0,183.0,37.0,,2019-08-01 14:40:59.969 -0400,False
9,4184476,5455198000.0,ashley_downing722,Ashley Downing,,,,,,2019-08-01 14:41:12.826 -0400,


#### We have a dataset of about 4 million observations: each one represents an instagram profile. 
For each observation we have 11 observed variables:
- sid (sequence ID)
- profile_id 
- profile_name
- first_name (person's first and last name)
- description (a sort of short biography of the person)
- following
- followers
- n_posts
- url (any link the person wants to share through their profile)
- cts (date and time the profile was last visited)
- is_business_account (if the person has a business account) 

#### What is noticeable (even from just the first 10 rows) is the large amount of NaN in the dataset. The only two varibles with all values entered are those referring to profile IDs.

In [42]:
profiles.isnull().sum()

sid                          0
profile_id               32447
profile_name                 0
firstname_lastname      288465
description            2055996
following              1056815
followers              1056815
n_posts                1056815
url                    3639312
cts                     438488
is_business_account    1064263
dtype: int64

### Import data: INSTAGRAM LOCATIONS

In [7]:
locations = pd.read_csv(r"D:\DATASET_ADM-HW2\archive\instagram_locations.csv", delimiter='\t')

### EXPLORATORY ANALYSIS

In [8]:
locations.shape

(1022658, 23)

In [9]:
for i in locations.columns:
    print(i)

sid
id
name
street
zip
city
region
cd
phone
aj_exact_city_match
aj_exact_country_match
blurb
dir_city_id
dir_city_name
dir_city_slug
dir_country_id
dir_country_name
lat
lng
primary_alias_on_fb
slug
website
cts


In [10]:
locations.head(10)

Unnamed: 0,sid,id,name,street,zip,city,region,cd,phone,aj_exact_city_match,...,dir_city_name,dir_city_slug,dir_country_id,dir_country_name,lat,lng,primary_alias_on_fb,slug,website,cts
0,719981,110296492939207,"Playa de Daimuz - Valencia, España",,,,,,,False,...,,,,,-0.139475,38.974391,daimuzplaya,playa-de-daimuz-valencia-espana,https://es.wikipedia.org/wiki/Daimuz,2019-05-29 01:21:29.987
1,719983,274391278,Nová Vieska,,,Nová Vieska,,SK,,True,...,Kis-Újfalu,kis-ujfalu,SK,Slovakia,18.466667,47.866667,,nova-vieska,,2019-05-29 01:21:38.037
2,719985,148885595789195,Everest Today,Himalayas,977.0,"Kathmandu, Nepal",,NP,,False,...,Pasupati,pasupati,NP,Nepal,85.33015,27.70196,EverestToday,everest-today,,2019-05-29 01:21:46.295
3,719987,263258277,BULAC - Bibliothèque universitaire des langues...,"65, rue des Grands-Moulins",75013.0,"Paris, France",,FR,01 81 69 18 00,False,...,13ème Arrondissement Paris,13eme-arrondissement-paris,FR,France,2.375995,48.82724,BULAC.Paris,bulac-bibliotheque-universitaire-des-langues-e...,www.bulac.fr,2019-05-29 01:21:54.355
4,326443,406147529857708,ABC Cable Networks Group,3800 W Alameda Ave,91505.0,"Burbank, California",,US,(818) 569-7500,False,...,,,,,-118.341864,34.153265,,abc-cable-networks-group,,2019-04-02 15:22:55.703
5,326440,242403516699715,The Lakes at Discovery Bay,,,,,,(925) 308-3883,,...,,,,,-121.621549,37.925412,TheLakesatDiscoveryBay,the-lakes-at-discovery-bay,www.TheLakesatDiscoveryBay.com,2019-04-02 15:22:55.367
6,719988,1651686855080719,"Tampines, Singapore",Tampines,529941.0,Singapore,,SG,,False,...,,,,,103.949729,1.355203,TampinesZingapurA,tampines-singapore,,2019-05-29 01:21:56.635
7,719992,240487083,Sittano’s Bar & Restaurant,"Shop R03 Westfield Penrith, Level 1 / Riley St...",2750.0,"Penrith, New South Wales",,AU,0247224444,False,...,Penrith,penrith,AU,Australia,150.694367,-33.751031,Sittanos,sittanos-bar-restaurant,http://www.sittanos.com.au/,2019-05-29 01:22:12.909
8,719996,750669435108256,วัดท่าซุง อุทัยธานี,3212,61000.0,"Nam Soem, Uthai Thani, Thailand",,TH,0854623871,False,...,,,,,100.073586,15.329776,,,http://www.watthasung.com,2019-05-29 01:22:27.749
9,719998,223283275,Cine Atlas,Hatanpään valtatie 1,33100.0,"Tampere, Finland",,FI,,False,...,,,,,23.766263,61.49569,,cine-atlas,http://www.finnkino.fi/cinemas/tampere_cine_atlas,2019-05-29 01:22:35.936


#### We have a dataset of about 1 million rows: each row gives us information about where content has been posted by users. 
The observed variables are:
- sid (sequence ID)
- id
- name (locations names)
- street (street address)
- zip (zip code)
- city (city name)
- region (region name)
- cd (country code)
- phone (phone's number)
- aj_exact_city_match
- aj_exact_country_match
- blurb (description of the place)
- dir_city_id (instagram internal City ID)
- dir_city_name (city name)
- dir_city_slug (city tag)
- dir_country_id 
- dir_country_name
- lat (latitude)
- lng (longitude)
- primary_alias_on_fb
- slug 
- website (profile's website)
- cts (timestamp when the location was visited)

In [45]:
locations.isnull().sum()

sid                             0
id                              0
name                            0
street                     306954
zip                        307079
city                        85492
region                    1020898
cd                          83648
phone                      420970
aj_exact_city_match         22148
aj_exact_country_match      22148
blurb                      615953
dir_city_id                526960
dir_city_name              526960
dir_city_slug              527437
dir_country_id             527030
dir_country_name           526960
lat                          6163
lng                          6163
primary_alias_on_fb        597127
slug                        80990
website                    399396
cts                             0
dtype: int64

#### Just as in the profile dataset, there are also many null values in this dataset. 
In particular, we note that the region variable has 99% of the observations missing and that all variables containing instagram internal information have NaNs for half of the observations. 

### Import data: INSTAGRAM POSTS

In [13]:
posts=dt.fread(r"D:\DATASET_ADM-HW2\archive\instagram_posts.csv", 
               sep="\t", 
               columns={"sid_profile","profile_id","location_id","post_type","numbr_likes","number_comments"}).to_pandas()

In [14]:
posts.shape

(42710197, 6)

In [15]:
for i in posts.columns:
    print(i)

sid_profile
profile_id
location_id
post_type
numbr_likes
number_comments


In [16]:
posts.head(10)

Unnamed: 0,sid_profile,profile_id,location_id,post_type,numbr_likes,number_comments
0,3496776,2237948000.0,1022366000000000.0,2,80.0,0.0
1,-1,5579335000.0,457426800000000.0,1,25.0,1.0
2,-1,313429600.0,457426800000000.0,1,9.0,0.0
3,-1,1837593000.0,457426800000000.0,1,4.0,0.0
4,-1,1131527000.0,457426800000000.0,1,8.0,0.0
5,-1,16262390.0,282618700.0,1,138.0,15.0
6,-1,35673870.0,282618700.0,1,389.0,10.0
7,-1,840873400.0,282618700.0,1,198.0,23.0
8,-1,329994.0,282618700.0,1,127.0,8.0
9,-1,360796500.0,282618700.0,1,154.0,6.0


#### This dataset contains over 42 million rows containing information about the different posts.
The observed variables are:
- sid  
- sid_profile      
- post_id    
- profile_id  
- location_id
- cts (timestamp when the Post was created)
- post_type (1 - Photo, 2 - Video, 3 - both)
- descriptions
- numbr_likes (number of likes at the moment it was visited)
- number_comments (number of comments at the moment it was visited)


Note: Due to the large size of the dataset, the 'descriptions' column, which contains the descriptions of all registered posts, was excluded when importing it, as it is not considered relevant for most of our analysis. 