# Data Inspection

This notebook intends to explore the data and identify data quality issues, like missing values, duplicate data, etc.

In [1]:
import pandas as pd
import numpy as np
import os


# import helper_functions.py
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '../src/data')
import helper_functions as h

# load metadata and googletrends search interest
metadata_files = h.get_files('../data/raw', name_contains='*metadata*')
gtrends_files = h.get_files('../data/raw', name_contains='*gtrends*')

gtrends_column_names = ['date', 'keyword', 'search_interest']

list_metadata_df = [pd.read_csv(i) for i in metadata_files]
list_gtrends_df = [pd.read_csv(i, header=0, names=gtrends_column_names) for i in gtrends_files]

# consider only first query
df_metadata = list_metadata_df[0]
df = list_gtrends_df[0]

## Core specifications

### Data types

The first google trends query with the filename `20201017-191627gtrends.csv` has 998324 entries and columns date, keyword and search_interest. They store data as string, string and float64 respectively. The type of *date* should match the stored date values and should be converted to date-type.

The corresponding metadata `20201017-191627gtrends_metadata.csv` contains 3835 entries and 11 columns 'topic', 'positive', 'date_define_topic', 'ticker', 'firm_name_raw', 'sector', 'firm_name_processed', 'date_get_firmname', 'keyword', 'date_construct_keyword', 'date_query_googletrends'. Except for the dummy variable indicating positive or negative ESG criteria, *positive*, all columns store strings. Columns that store dates as strings should be similarly changed to appropriate datetime formats.




In [None]:
print("Filenames\n", '-'*40)
print(gtrends_files[0].split('\\')[-1])
print(metadata_files[0].split('\\')[-1])

print('\nDataframe info\n', '-'*40)
print(df.info(), df_metadata.info())

### Descriptives

The descriptives for the main `gtrends.csv` data reveal an issue with the *date* column, where 0 occurs for some keyword queries. It arises by construction when the Google Trends query returns no measurable search interest for five consecutive keywords. The date column needs to be imputed for these cases by the query values. Future versions of the data collection script could store the queried date range in the corresponding `metadata.csv` file to impute more easily.

Apart from this, there seem to be no anomalies. *Keyword* contains 261 unique values as expected since Google Trends returns weekly search interest over the last five years, plus a trajectory for the current week $5 \text{ years} \times 52 \text{ weeks} + 1 \text{ (current week)} = 261$.

*Note:* I intentionally omit descriptives of the `metadata.csv` file. It contains all data about the Google Trends query when it was initialized. 

In [None]:
round(df.describe(include='all', percentiles=[]))

## Missings

The main data from Google Trends does not contain missings. Nevertheless, date values with 0.0 need to be replaced for convenience.  

In [None]:
# check rows
rows_all = df.shape[0]
rows_nomiss = df.dropna().shape[0]

rowmiss_count = rows_all - rows_nomiss
rowmiss_share = rowmiss_count/rows_all*100

print("Missings per row: {}/{} ({} %)".format(rowmiss_count,rows_all, rowmiss_share))

# check columns
col_miss = [col for col in df.columns if df[col].isna().any()]

if not col_miss:
    print("No missings for any column.")
    
else:
    print(df.loc[:,col_miss].isna().sum())
    

## Duplicates

The data contains 687700 duplicated rows for the *date* and *keyword* column. Google Trends returns `None` When search interest for all five keywords unmeasurable small. The `query()` function replaces values for *date* and *search_interest* with a 0.0 (float64). In contrast, a successful query returns 261 rows which is 1 more than the `query()` replaces. To be consistent, especially before imputing the date, this has to be fixed. 

An overview to improve data sourcing for Google Trends `query_google_trends.py` and its `handle_query_results()` method is listed below.

In [None]:
df[df.duplicated()].value_counts() #.describe(include='all', percentiles=[])

## Improvements for data collection and the `query()` method

Future implementations of the `handle_query_results()` method should consider thde following:

For queries that return `None` due to no search interest
* return 261 rows for unsuccessful queries instead of 260
* impute date with 261 date entries from successful queries. 


*Note:* Improvements are listed to be concise and avoid a re-launch of the query process which takes time due to timeouts. 

# Data inspection: Shortcut functions to inspect data

In [244]:
def inspect_core_specifications(data, descriptives=False):
    """Inspect data types, shape and descriptives
    
    :param data: pandas dataframe 
    :param descriptives: boolean, print descriptive statistics (default=False)
    :return: None
    """    
    # check if data is list of dataframes
    if isinstance(data, list):
        for d in data:
            print('-'*40)
            print(d.info())
            
            if descriptives:
                print('-'*40)
                print(round(d.describe(include='all', percentiles=[])))
            
    else:
        print('-'*40)
        print(data.info())
        
        if descriptives:
            print('-'*40)
            print(round(data.describe(include='all', percentiles=[])))
    print('*'*40)
        

def inspect_missings(data):
    """Inspect missings across rows and across columns
    
    :param data: pandas dataframe 
    :return: List with column names that contain missings 
    """
    print("MISSINGS")
    print('-'*40)
    # check rows
    rows_all = data.shape[0]
    rows_nomiss = data.dropna().shape[0]

    rowmiss_count = rows_all - rows_nomiss
    rowmiss_share = rowmiss_count/rows_all*100

    print("Missings per row: {}/{} ({} %)".format(rowmiss_count,rows_all, rowmiss_share))
    print()
    
    # check columns
    col_miss = [col for col in data.columns if data[col].isna().any()]
    # no missings for any column
    if not col_miss:
        print("No missings for any column.")
    else:
        # print share of missings for each column
        print("Column missings")
        ds_colmiss = data.loc[:,col_miss].isna().sum()
        ds_colmiss_relative = data.loc[:,col_miss].isna().sum()/rows_all*100
        
        print(pd.concat([ds_colmiss, ds_colmiss_relative], axis=1, keys=['Count', 'Share (%)']))
            
    print('*'*40)
    
    return list(ds_colmiss.index)

def inspect_duplicates(data):
    """Gives an overview of duplicate rows 
    
    :param data: DataFrame
    :return: None
    """
    
    pass

In [197]:
inspect_core_specifications(df)
inspect_missings(df)

----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998324 entries, 0 to 998323
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   date             998324 non-null  object 
 1   keyword          998324 non-null  object 
 2   search_interest  998324 non-null  float64
dtypes: float64(1), object(2)
memory usage: 22.8+ MB
None
****************************************
MISSINGS
----------------------------------------
Missings per row: 0/998324 (0.0 %)

No missings for any column.
****************************************


In [245]:
test_df = pd.DataFrame([np.zeros(10), np.zeros(10), np.repeat(np.nan, 10)]).T
inspect_missings(test_df)

MISSINGS
----------------------------------------
Missings per row: 10/10 (100.0 %)

Column missings
   Count  Share (%)
2     10      100.0
****************************************


[2]