# 1. Automate data extraction from CSV files.

From project definition:

> I've noticed that much of the data that I require is available through CSV files that can be manually downloaded from each platform. What I want to do for the first iteration of this project is to use Python to extract the data I need from those CSV files and generate tables that mimic the ones I need to populate on Google Sheets. I will download the CSV files manually and enter the data manually, but I won't have to comb through the files myself to find the data I want.

## Data Entry Tables

The following five platforms are used each month for analysis:

1. LinkedIn
2. Twitter
3. Facebook
4. Google Analytics
5. Google Ads

All of the companies use 1, 4 and 5, whereas only one company uses 2 and 3. For the data entry, the spreadsheets are **standardized**, meaning they look the exact same for every company no matter what platforms they're using. This makes data entry spreadsheet the easiest part of the extraction process to start with.

In the next few cells, we'll take a look at what the master tables look like and where the data is stored in the CSV files.

### LinkedIn

| LinkedIn        | Month |
|-----------------|-------|
|      Posts      |    13 |
|    Followers    |    61 |
|  New followers  |     2 |
|   Impressions   | 3,351 |
|      Clicks     |    31 |
|      Likes      |    30 |
|      Shares     |     0 |
|     Comments    |     0 |
|    Engagement   |    61 |
| Engagement Rate | 1.82% |

`Followers`, `Engagement` and `Engagement Rate` are automatically calculated and won't need to be extracted from the dataset. In other words, the fields that need to be grabbed from the LinkedIn dataset are as follows:

* `Posts`
* `New Followers`
* `Impressions`
* `Clicks`
* `Likes`
* `Shares`
* `Comments`

Some data may also need to be entered manually into the report itself. This data is closely related with what goes into the data entry sheet, so it'd be useful to have it ready to go already. This data is as follows:

* `Organic New Followers`
* `Organic Impressions`
* `Organic Clicks`
* `Organic Likes`
* `Organic Shares`
* `Organic Comments`
* `Page Views`

#### Updates

Most of the desired metrics appear to be gleanable from one XLS file generated from the *Updates* report on LinkedIn. I exported daily update data for the desired time period and downloaded the file. The report was saved in the form `<company-name>_updates_1234567891011.xls`, where `1234567891011` is a semi-random 13-character string generated at the time of export.

Since we know the correct numbers, we can now try to extract the desired metrics from the XLS file. Here are the correct numbers again:

| LinkedIn        | Month |
|-----------------|-------|
|      Posts      |    13 |
|    Followers    |    61 |
|  New followers  |     2 |
|   Impressions   | 3,351 |
|      Clicks     |    31 |
|      Likes      |    30 |
|      Shares     |     0 |
|     Comments    |     0 |
|    Engagement   |    61 |
| Engagement Rate | 1.82% |

In [2]:
import pandas as pd

In [14]:
import os

ROOT_DIR = os.path.dirname(os.path.abspath("../../setup.py"))

DATA_DIR = os.path.join(ROOT_DIR, "data")

In [19]:
os.path.join(ROOT_DIR, "data")

'/home/jayascript/Code/jayascript/monthly_reporting/data'

In [22]:
import pandas as pd

file = f"{DATA_DIR}/raw/nichifu_updates_1597171420378.xls"

In [39]:
data = pd.read_excel(file, sheet_name=[0, 1], skiprows=1)

In [40]:
metrics = data[0]
updates = data[1]

In [41]:
metrics

Unnamed: 0,Date,Impressions (organic),Impressions (sponsored),Impressions (total),Unique impressions (organic),Clicks (organic),Clicks (sponsored),Clicks (total),Reactions (organic),Reactions (sponsored),Reactions (total),Comments (organic),Comments (sponsored),Comments (total),Shares (organic),Shares (sponsored),Shares (total),Engagement rate (organic),Engagement rate (sponsored),Engagement rate (total)
0,07/01/2020,12,0,12,5,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
1,07/02/2020,29,0,29,7,0,0,0,2,0,2,0,0,0,0,0,0,0.068966,0.0,0.068966
2,07/03/2020,46,0,46,7,0,0,0,2,0,2,0,0,0,0,0,0,0.043478,0.0,0.043478
3,07/04/2020,13,0,13,8,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
4,07/05/2020,4,0,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
5,07/06/2020,10,0,10,6,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
6,07/07/2020,32,0,32,5,2,0,2,3,0,3,0,0,0,0,0,0,0.15625,0.0,0.15625
7,07/08/2020,9,0,9,6,0,0,0,1,0,1,0,0,0,0,0,0,0.111111,0.0,0.111111
8,07/09/2020,5,0,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
9,07/10/2020,19,0,19,13,0,0,0,2,0,2,0,0,0,0,0,0,0.105263,0.0,0.105263
