## Data Cleaning: Google Trends Data

This notebook will work through the steps to clean the Trends data that have been scraped from Google Trends^.

^Trends data were scraped from https://trends.google.com  
See the notebook `data_analysis/data_cleaning/trends_dl.ipynb` for more details.

In [1]:
import pandas as pd

### Load the data

In [2]:
# import csv data
trends_df = pd.read_csv('../../data/google_trends/trends_monthly.csv')
trends_df

Unnamed: 0,date,dengue,dengue fever,bone pain,rain,mosquito bite,fever,rashes,rash,mosquito
0,2022-01-02,11,0,0,87,30,54,26,66,39
1,2022-01-09,12,21,58,34,19,59,86,67,40
2,2022-01-16,16,11,0,43,19,64,75,70,58
3,2022-01-23,16,0,19,32,45,66,55,80,46
4,2022-01-30,13,0,44,53,56,64,76,76,44
...,...,...,...,...,...,...,...,...,...,...
569,2012-11-25,60,63,45,36,0,89,42,71,84
570,2012-12-02,32,33,0,49,0,94,49,65,67
571,2012-12-09,59,60,87,47,0,86,90,76,50
572,2012-12-16,49,63,36,57,51,86,55,62,61


### Add the year and week

Based on the epidemiological weeks used in the dengue data, we standardize all our datasets to be based on the same epidemiological weeks, which starts on Sunday and end on Saturday (7 days).

Since the granularity of Google Trends data is every 7 days when scraping year by year, it allows us to simply increment the week numbers for every year by 1 for each row (ie. row 0 = week 1, row 2 = week 2, and so on).

In [3]:
# year is simply the first 4 numbers in the `date` column
trends_df['year'] = trends_df['date'].str[:4].astype('int')

# increment the week numbers by row per year
trends_df['week'] = trends_df.groupby('year').cumcount() + 1

trends_df

Unnamed: 0,date,dengue,dengue fever,bone pain,rain,mosquito bite,fever,rashes,rash,mosquito,year,week
0,2022-01-02,11,0,0,87,30,54,26,66,39,2022,1
1,2022-01-09,12,21,58,34,19,59,86,67,40,2022,2
2,2022-01-16,16,11,0,43,19,64,75,70,58,2022,3
3,2022-01-23,16,0,19,32,45,66,55,80,46,2022,4
4,2022-01-30,13,0,44,53,56,64,76,76,44,2022,5
...,...,...,...,...,...,...,...,...,...,...,...,...
569,2012-11-25,60,63,45,36,0,89,42,71,84,2012,49
570,2012-12-02,32,33,0,49,0,94,49,65,67,2012,50
571,2012-12-09,59,60,87,47,0,86,90,76,50,2012,51
572,2012-12-16,49,63,36,57,51,86,55,62,61,2012,52


### Rename the columns

Standardize the names of the columns such that there are no spaces, and make sure they are succinct and in lowercase.

In [4]:
trends_df = trends_df.rename(columns={'dengue fever': 'dengue_fever',
                                      'bone pain': 'bone_pain',
                                      'mosquito bite': 'mosquito_bite'})
trends_df

Unnamed: 0,date,dengue,dengue_fever,bone_pain,rain,mosquito_bite,fever,rashes,rash,mosquito,year,week
0,2022-01-02,11,0,0,87,30,54,26,66,39,2022,1
1,2022-01-09,12,21,58,34,19,59,86,67,40,2022,2
2,2022-01-16,16,11,0,43,19,64,75,70,58,2022,3
3,2022-01-23,16,0,19,32,45,66,55,80,46,2022,4
4,2022-01-30,13,0,44,53,56,64,76,76,44,2022,5
...,...,...,...,...,...,...,...,...,...,...,...,...
569,2012-11-25,60,63,45,36,0,89,42,71,84,2012,49
570,2012-12-02,32,33,0,49,0,94,49,65,67,2012,50
571,2012-12-09,59,60,87,47,0,86,90,76,50,2012,51
572,2012-12-16,49,63,36,57,51,86,55,62,61,2012,52


In [5]:
trends_df

Unnamed: 0,date,dengue,dengue_fever,bone_pain,rain,mosquito_bite,fever,rashes,rash,mosquito,year,week
0,2022-01-02,11,0,0,87,30,54,26,66,39,2022,1
1,2022-01-09,12,21,58,34,19,59,86,67,40,2022,2
2,2022-01-16,16,11,0,43,19,64,75,70,58,2022,3
3,2022-01-23,16,0,19,32,45,66,55,80,46,2022,4
4,2022-01-30,13,0,44,53,56,64,76,76,44,2022,5
...,...,...,...,...,...,...,...,...,...,...,...,...
569,2012-11-25,60,63,45,36,0,89,42,71,84,2012,49
570,2012-12-02,32,33,0,49,0,94,49,65,67,2012,50
571,2012-12-09,59,60,87,47,0,86,90,76,50,2012,51
572,2012-12-16,49,63,36,57,51,86,55,62,61,2012,52


In [6]:
# keep only the relevant columns and reorder them for easy reading
trends_df = trends_df[['year', 'week', 'dengue_fever', 'dengue', 'bone_pain',
                       'rain', 'mosquito_bite', 'fever', 'rashes', 'rash', 'mosquito']]
trends_df = trends_df.astype('int')
trends_df

Unnamed: 0,year,week,dengue_fever,dengue,bone_pain,rain,mosquito_bite,fever,rashes,rash,mosquito
0,2022,1,0,11,0,87,30,54,26,66,39
1,2022,2,21,12,58,34,19,59,86,67,40
2,2022,3,11,16,0,43,19,64,75,70,58
3,2022,4,0,16,19,32,45,66,55,80,46
4,2022,5,0,13,44,53,56,64,76,76,44
...,...,...,...,...,...,...,...,...,...,...,...
569,2012,49,63,60,45,36,0,89,42,71,84
570,2012,50,33,32,0,49,0,94,49,65,67
571,2012,51,60,59,87,47,0,86,90,76,50
572,2012,52,63,49,36,57,51,86,55,62,61


### Save the cleaned dataframe as a csv

So that we can easily read it in during analysis later on!

In [7]:
trends_df.to_csv('../../data/cleaned/trends_clean.csv', index=False)