# 1. Automate LinkedIn data extraction from XLS files.

From project definition:

> I've noticed that much of the data that I require is available through CSV files that can be manually downloaded from each platform. What I want to do for the first iteration of this project is to use Python to extract the data I need from those CSV files and generate tables that mimic the ones I need to populate on Google Sheets. I will download the CSV files manually and enter the data manually, but I won't have to comb through the files myself to find the data I want.

In this notebook, I'm going to automate LinkedIn metric and demographic data extraction from XLS files. The spreadsheets are **standardized**, meaning that they look the same for every company. I should be able to automate the process so that the code will generate a table populated with all LinkedIn data for all possible companies passed in via the command line.

## Data Entry Table

| LinkedIn        | Month |
|-----------------|-------|
|      Posts      |    13 |
|    Followers    |    61 |
|  New followers  |     2 |
|   Impressions   | 3,351 |
|      Clicks     |    31 |
|      Likes      |    30 |
|      Shares     |     0 |
|     Comments    |     0 |
|    Engagement   |    61 |
| Engagement Rate | 1.82% |

`Followers`, `Engagement` and `Engagement Rate` are automatically calculated and won't need to be extracted from the dataset. In other words, the fields that need to be grabbed from the LinkedIn dataset are as follows:

* `Posts`
* `New Followers`
* `Impressions`
* `Clicks`
* `Likes`
* `Shares`
* `Comments`

Some data may also need to be entered manually into the report itself. This data is closely related with what goes into the data entry sheet, so it'd be useful to have it ready to go already. This data is as follows:

* `Organic New Followers`
* `Organic Impressions`
* `Organic Clicks`
* `Organic Likes`
* `Organic Shares`
* `Organic Comments`
* `Page Views`

### Updates

Most of the desired metrics appear to be gleanable from one XLS file generated from the *Updates* report on LinkedIn. I exported daily update data for the desired time period and downloaded the file. The report was saved in the form `<company-name>_updates_1234567891011.xls`, where `1234567891011` is a semi-random 13-character string generated at the time of export.

Since we know the correct numbers, we can now try to extract the desired metrics from the XLS file.

In [2]:
import pandas as pd

In [14]:
import os

ROOT_DIR = os.path.dirname(os.path.abspath("../../setup.py"))

DATA_DIR = os.path.join(ROOT_DIR, "data")

In [19]:
os.path.join(ROOT_DIR, "data")

'/home/jayascript/Code/jayascript/monthly_reporting/data'

In [42]:
import pandas as pd

file = f"{DATA_DIR}/raw/company1_updates_0000000000000.xls"

In [43]:
data = pd.read_excel(file, sheet_name=[0, 1], skiprows=1)

In [77]:
metrics = data[0]
updates = data[1]

Now that we've got the sheet read in and working, we can start populating the desired table:

| LinkedIn        | Month |
|-----------------|-------|
|      Posts      |    13 |
|    Followers    |    61 |
|  New followers  |     2 |
|   Impressions   | 3,351 |
|      Clicks     |    31 |
|      Likes      |    30 |
|      Shares     |     0 |
|     Comments    |     0 |
|    Engagement   |    61 |
| Engagement Rate | 1.82% |

In [125]:
linkedin_data = pd.Series(dtype='int')

In [126]:
linkedin_data

Series([], dtype: int64)

In [127]:
linkedin_data["Posts"] = len(updates)

In [128]:
metrics_to_get = "Impressions", "Clicks", "Reactions", "Shares", "Comments"

In [130]:
def get_update_metrics(metrics_to_get):
    for metric in metrics_to_get:
        organic = sum(metrics[f"{metric} (organic)"])
        total = sum(metrics[f"{metric} (total)"])
        
        linkedin_data[f"{metric}"] = total
        linkedin_data[f"Organic {metric}"] = organic
        
        print(organic, total)

In [131]:
get_update_metrics(metrics_to_get)

697 3351
25 31
29 30
0 0
0 0


In [132]:
linkedin_data

Posts                    13
Impressions            3351
Organic Impressions     697
Clicks                   31
Organic Clicks           25
Reactions                30
Organic Reactions        29
Shares                    0
Organic Shares            0
Comments                  0
Organic Comments          0
dtype: int64

### Followers

Two of the remaining desired metrics appear to be gleanable from one XLS file generated from the *Followers* report on LinkedIn. I exported all follower data for the desired time period and downloaded the file. The report was saved in the form `<company-name>_followers_1234567891011.xls`, where `1234567891011` is a semi-random 13-character string generated at the time of export.

In [133]:
file = f"{DATA_DIR}/raw/company1_followers_0000000000000.xls"
new_followers = pd.read_excel(file, sheet_name=[0])

In [137]:
def get_follower_metrics(new_followers):
    organic = sum(new_followers["Organic followers"])
    total = sum(new_followers["Total followers"])
        
    linkedin_data["New Followers"] = total
    linkedin_data["Organic New Followers"] = organic
        
    print(organic, total)

In [138]:
get_follower_metrics(new_followers[0])

2 2


In [139]:
linkedin_data

Posts                      13
Impressions              3351
Organic Impressions       697
Clicks                     31
Organic Clicks             25
Reactions                  30
Organic Reactions          29
Shares                      0
Organic Shares              0
Comments                    0
Organic Comments            0
New Followers               2
Organic New Followers       2
dtype: int64

### Page Views

The last desired metric for the LinkedIn report is gleanable from one XLS file generated from the *Visitors* report on LinkedIn. I exported all visitor data for the desired time period and downloaded the file. The report was saved in the form `<company-name>_visitors_1234567891011.xls`, where `1234567891011` is a semi-random 13-character string generated at the time of export.

In [140]:
file = f"{DATA_DIR}/raw/company1_visitors_0000000000000.xls"
visitor_metrics = pd.read_excel(file, sheet_name=[0])

In [141]:
visitor_metrics[0]

Unnamed: 0,Date,Overview page views (desktop),Overview page views (mobile),Overview page views (total),Overview unique visitors (desktop),Overview unique visitors (mobile),Overview unique visitors (total),Life page views (desktop),Life page views (mobile),Life page views (total),...,Jobs page views (total),Jobs unique visitors (desktop),Jobs unique visitors (mobile),Jobs unique visitors (total),Total page views (desktop),Total page views (mobile),Total page views (total),Total unique visitors (desktop),Total unique visitors (mobile),Total unique visitors (total)
0,07/01/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,07/02/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,07/03/2020,9,1,10,1,1,2,0,0,0,...,0,0,0,0,9,1,10,1,1,2
3,07/04/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,07/05/2020,0,2,2,0,1,1,0,0,0,...,0,0,0,0,0,2,2,0,1,1
5,07/06/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,07/07/2020,6,0,6,3,0,3,0,0,0,...,0,0,0,0,6,0,6,3,0,3
7,07/08/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,07/09/2020,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,07/10/2020,4,0,4,2,0,2,0,0,0,...,0,0,0,0,4,0,4,2,0,2


In [145]:
linkedin_data["Page views"] = sum(visitor_metrics[0]["Total page views (total)"])

In [146]:
linkedin_data

Posts                      13
Impressions              3351
Organic Impressions       697
Clicks                     31
Organic Clicks             25
Reactions                  30
Organic Reactions          29
Shares                      0
Organic Shares              0
Comments                    0
Organic Comments            0
New Followers               2
Organic New Followers       2
Page views                 82
dtype: int64