# 1.2 Automate LinkedIn demographic data extraction from XLS files

In the previous notebook, I finished automating the extraction of metric data from monthly reports curated from LinkedIn and stored in XLS files. In this one, I want to use one of the XLS files (I believe `company1_visitors.xls`) to automate demographic data extraction and complete the LinkedIn report.

In [149]:
import os
import pandas as pd

ROOT_DIR = os.path.dirname(os.path.abspath("../../setup.py"))
DATA_DIR = os.path.join(ROOT_DIR, "data/raw/linkedin")

In [226]:
file = f"{DATA_DIR}/jotovent-2020-09_visitors.xls"
assert os.path.exists(file)

## Demographic Data Tables

Each LinkedIn report wants demographic data from three areas: `Location`, `Industry`, and `Job Function`. All of these tables will require calling out to a database that contains E<->J translations to get the Japanese versions of the English words returned from LinkedIn.

### Location Data Tables

Final `Location` data tables can take on one of two forms. The first is country-level:

|    Country    |  国  | Visitors |
|:-------------:|:----:|:--------:|
| United States | 米国 |       61 |
| China         | 中国 |        9 |
| Taiwan        | 台湾 |        2 |
|               |      |          |
|               |      |          |

The second is area-level:

| Country                  |             国             | Visitors |
|--------------------------|:--------------------------:|:--------:|
| Greater Seattle Area     | シアトルエリア             |       52 |
| Greater Chicago Area     | シカゴエリア               |        4 |
| Greater Los Angeles Area | ロサンゼルスエリア         |        3 |
| San Francisco Bay Area   | サンフランシスコベイエリア |        3 |
| Osaka, Japan             | 大阪                       |        3 |

I'm unsure if I want to forego the area-level table and just report country-level data for each company. Regardless, I need to be able to transform the area-level data into country-level data, because LinkedIn provides location data by area.

### Industry Data Tables

`Industry` data is more standardized:

|               Industry              |        産　業        | Visitors |
|:-----------------------------------:|:--------------------:|:--------:|
| Marketing and Advertising           | マーケティング・広告 |       32 |
| Electrical/Electronic Manufacturing | 電気／電子製造       |       17 |
| Consumer Electronics                | 家電                 |        9 |
| Machinery                           | 機械                 |        7 |
| Information Technology and Services | ITサービス           |        6 |

### Job Function Data Tables

Likewise, `Job Function` data is quite straightforward:

|     Job Function     |      職　務      | Visitors |
|:--------------------:|:----------------:|:--------:|
| Business Development | 事業開発         |       31 |
| Sales                | 販売             |       10 |
| Engineering          | エンジニアリング |        8 |
| Marketing            | マーケティング   |        5 |
| Administrative       | 行政             |        3 |

I'm noticing that in the Google Sheets for these data tables, the cells that contain each value are not aligned across sheets (i.e. Industry index is in column K on one sheet, column L on another). So I'll have to work manually with inputting the data for now, and then either hard-code the program to put the data in the right place, or have the code find the right cell; or else try to standardize the spreadsheets. That could be problematic because I'm not the only one with access to it, and if another end-user changes the sheet without my knowledge, then the code could mess everything up if it's automated.

## Dictionary

In [249]:
dictionary = pd.read_csv(f"{DATA_DIR}/linkedin_dictionary.csv")

## Job Function

I want to start with the easiest one, so I'll be populating job function tables first.

In [236]:
visitors_job_functions = pd.read_excel(file, sheet_name=[2])

In [237]:
job_function = visitors_job_functions[2].sort_values(by="Total views", ascending=False).iloc[:5].reset_index(drop=True)

In [238]:
jobs = list(job_function["Job function"])

In [239]:
def get_translation(word):
    loc = dictionary.loc[dictionary["English"] == word]["Japanese"].index[0]
    return dictionary.loc[dictionary["English"] == word].iloc[:,1][loc]

In [243]:
job_function["Japanese"] = [get_translation(job) for job in jobs]

In [244]:
job_function

Unnamed: 0,Job function,Total views,Japanese
0,Support,25,サポート
1,Business Development,15,事業開発
2,Sales,8,売上高
3,Information Technology,5,ITサービス
4,Purchasing,4,購買業界


## Industry

This one is easy as well, so I'll go ahead and populate it.

In [245]:
visitors_industry = pd.read_excel(file, sheet_name=[4])

In [246]:
industry = visitors_industry[4].sort_values(by="Total views", ascending=False).iloc[:6]

In [250]:
industries = list(industry["Industry"])
industry["Japanese"] = [get_translation(field) for field in industries]

In [251]:
industry

Unnamed: 0,Industry,Total views,Japanese
11,Plastics,26,プラスチック
6,Higher Education,15,高等教育
9,Design,11,デザイン
8,Information Technology and Services,6,ITサービス
2,Construction,5,建設業
1,Food Production,4,食材生産


## Location

This one is a bit more difficult, because I'll have to transform the dataset first if I want to get country-level data. But first, let's be sure I can grab area-level data first.

### Area-Level Data

In [255]:
area_file = f"{DATA_DIR}/jotovent-2020-09_visitors.xls"

In [256]:
visitors_location_by_area = pd.read_excel(area_file, sheet_name=[1])

In [257]:
visitors_location_by_area[1].sort_values(by="Total views", ascending=False).iloc[:5]

Unnamed: 0,Location,Total views
4,Greater Seattle Area,18
5,Hawaiian Islands,4
8,"Shanghai City, China",3
0,Greater Chicago Area,2
1,Greater Denver Area,2


### Country-Level Data

In [223]:
visitors_location_by_country = pd.read_excel(file, sheet_name=[1])

In [224]:
locations = visitors_location_by_country[1].sort_values(by="Total views", ascending=False)

In [225]:
locations

Unnamed: 0,Location,Total views
0,Greater Chicago Area,27
4,Washington D.C. Metro Area,7
9,"Pune Area, India",7
8,"Bengaluru Area, India",3
1,"Cincinnati, Ohio Area",2
2,"Houston, Texas Area",2
3,Greater Seattle Area,2
5,"Sydney, Australia",2
6,"Digras Area, India",2
10,"Shenzhen, Guangdong, China",2


In [187]:
countries = list(locations["Location"])

In [193]:
countries = [country.split(",") for country in countries]

In [200]:
countries = [country[-1].strip() for country in countries]

In [208]:
final_list = []
for country in countries:
    if 'Area' in country:
        final_list.append("United States")
    else:
        final_list.append(country)

In [210]:
locations["Location"] = final_list

In [221]:
import numpy as np
locations.groupby("Location").agg(np.sum).sort_values(by="Total views", ascending=False)

Unnamed: 0_level_0,Total views
Location,Unnamed: 1_level_1
United States,40
India,13
Australia,2
China,2
Philippines,2


In [222]:
locations

Unnamed: 0,Location,Total views
0,United States,27
4,United States,7
9,India,7
8,India,3
1,United States,2
2,United States,2
3,United States,2
5,Australia,2
6,India,2
10,China,2
