# 1.2 Automate LinkedIn demographic data extraction from XLS files

In the previous notebook, I finished automating the extraction of metric data from monthly reports curated from LinkedIn and stored in XLS files. In this one, I want to use one of the XLS files (I believe `company1_visitors.xls`) to automate demographic data extraction and complete the LinkedIn report.

In [14]:
import os
import pandas as pd

ROOT_DIR = os.path.dirname(os.path.abspath("../../setup.py"))
DATA_DIR = os.path.join(ROOT_DIR, "data/raw/linkedin")

file = f"{DATA_DIR}/company1_visitors.xls"

## Demographic Data Tables

Each LinkedIn report wants demographic data from three areas: `Location`, `Industry`, and `Job Function`. All of these tables will require calling out to a database that contains E<->J translations to get the Japanese versions of the English words returned from LinkedIn.

### Location Data Tables

Final `Location` data tables can take on one of two forms. The first is country-level:

|    Country    |  国  | Visitors |
|:-------------:|:----:|:--------:|
| United States | 米国 |       61 |
| China         | 中国 |        9 |
| Taiwan        | 台湾 |        2 |
|               |      |          |
|               |      |          |

The second is area-level:

| Country                  |             国             | Visitors |
|--------------------------|:--------------------------:|:--------:|
| Greater Seattle Area     | シアトルエリア             |       52 |
| Greater Chicago Area     | シカゴエリア               |        4 |
| Greater Los Angeles Area | ロサンゼルスエリア         |        3 |
| San Francisco Bay Area   | サンフランシスコベイエリア |        3 |
| Osaka, Japan             | 大阪                       |        3 |

I'm unsure if I want to forego the area-level table and just report country-level data for each company. Regardless, I need to be able to transform the area-level data into country-level data, because LinkedIn provides location data by area.

### Industry Data Tables

`Industry` data is more standardized:

|               Industry              |        産　業        | Visitors |
|:-----------------------------------:|:--------------------:|:--------:|
| Marketing and Advertising           | マーケティング・広告 |       32 |
| Electrical/Electronic Manufacturing | 電気／電子製造       |       17 |
| Consumer Electronics                | 家電                 |        9 |
| Machinery                           | 機械                 |        7 |
| Information Technology and Services | ITサービス           |        6 |

### Job Function Data Tables

Likewise, `Job Function` data is quite straightforward:

|     Job Function     |      職　務      | Visitors |
|:--------------------:|:----------------:|:--------:|
| Business Development | 事業開発         |       31 |
| Sales                | 販売             |       10 |
| Engineering          | エンジニアリング |        8 |
| Marketing            | マーケティング   |        5 |
| Administrative       | 行政             |        3 |

I'm noticing that in the Google Sheets for these data tables, the cells that contain each value are not aligned across sheets (i.e. Industry index is in column K on one sheet, column L on another). So I'll have to work manually with inputting the data for now, and then either hard-code the program to put the data in the right place, or have the code find the right cell; or else try to standardize the spreadsheets. That could be problematic because I'm not the only one with access to it, and if another end-user changes the sheet without my knowledge, then the code could mess everything up if it's automated.

## Job Function

I want to start with the easiest one, so I'll be populating job function tables first.

In [15]:
visitors_job_functions = pd.read_excel(file, sheet_name=[2])

In [26]:
visitors_job_functions[2].sort_values(by="Total views", ascending=False).iloc[:5]

Unnamed: 0,Job function,Total views
1,Business Development,31
8,Sales,10
3,Engineering,8
4,Marketing,5
0,Administrative,3


That wasn't too bad!

## Industry

This one is easy as well, so I'll go ahead and populate it.

In [27]:
visitors_industry = pd.read_excel(file, sheet_name=[4])

In [29]:
visitors_industry[4].sort_values(by="Total views", ascending=False).iloc[:5]

Unnamed: 0,Industry,Total views
5,Marketing and Advertising,32
7,Electrical/Electronic Manufacturing,17
1,Consumer Electronics,9
4,Machinery,7
6,Information Technology and Services,6


## Location

This one is a bit more difficult, because I'll have to transform the dataset first if I want to get country-level data. But first, let's be sure I can grab area-level data first.

### Area-Level Data

In [30]:
area_file = f"{DATA_DIR}/company3_visitors.xls"

In [31]:
visitors_location_by_area = pd.read_excel(area_file, sheet_name=[1])

In [35]:
visitors_location_by_area[1].sort_values(by="Total views", ascending=False).iloc[:5]

Unnamed: 0,Location,Total views
6,Greater Seattle Area,52
0,Greater Chicago Area,4
1,Greater Los Angeles Area,3
5,San Francisco Bay Area,3
11,"Osaka, Osaka, Japan",3


### Country-Level Data

In [37]:
visitors_location_by_country = pd.read_excel(file, sheet_name=[1])

In [39]:
visitors_location_by_country[1]

Unnamed: 0,Location,Total views
0,Greater Boston Area,1
1,Greater Chicago Area,32
2,Greater Atlanta Area,1
3,Greater Seattle Area,22
4,Greater Minneapolis-St. Paul Area,1
5,"Raleigh-Durham, North Carolina Area",2
6,"Springfield, Illinois Area",2
7,"Shenzhen, Guangdong, China",9
8,"New Taipei City, Taiwan",2
