Check distribution of sentiment scores before the resampling step

In [1]:
import os
import numpy as np
import pandas as pd

**This code sorts the raw tweets with the computed sentiment scores into county-level files based on their FIPS codes.**

In [2]:
in_dir = "/Users/felix/Downloads/geolocated"
out_dir = "/Users/felix/Downloads/fips_sorted"

in_flist = os.listdir(in_dir)

# here, on the very top we would have a loop over the raw files:
for fname in in_flist:
    print("\n--------------- Processing file {} ---------------".format(fname))

    # read file, auto-convert dtypes and set date time index
    df = pd.read_pickle(os.path.join(in_dir, fname))
    df = df.infer_objects()
    df = df.set_index("Date")

    # TODO: It seems that the geolocation script did not locate all tweets. 
    # have to deal with those somehow. Drop them for now to experiment further.
    df_cleaned = df.drop(df.loc[df.FIPS == ""].index, axis=0)
    print("Dropped {} tweets due to missing geolocation.\n".format(df.shape[0] - df_cleaned.shape[0]))

    # get unique fips codes within the raw file
    fips_codes = df_cleaned.FIPS.unique()

    # iterate over fips codes and store the corresponding tweets in seperate, county-level data frames
    for fips_code in fips_codes:

        # fetch the subset of the tweet data frame corresponding to each unique FIPS code
        sub_df_county = df_cleaned.loc[df_cleaned.FIPS == fips_code]
        assert len(sub_df_county.FIPS.unique()) == 1

        # save or else open and append
        fname = "{}.pkl".format(fips_code)
        fpath = os.path.join(out_dir, fname)
        if os.path.exists(fpath):
            print("county file for fips code {} already exists.".format(fips_code))

            # open existing file and append tweets as new rows
            existing_df_county = pd.read_pickle(fpath)

            # this step could create new duplicates if ran twice for the same files.
            # after ordering all tweets by county, we should run a drop-duplicates 
            # script BEFORE resampling to daily sentiment distributions.
            merged_df_county = pd.concat([existing_df_county, sub_df_county], axis=0)

            # drop potential duplicate rows based on unique tweet ID (ESSENTIAL STEP)
            merged_df_county = merged_df_county.drop_duplicates(subset="ID")        

            # re-sort tweets by date
            merged_df_county = merged_df_county.sort_values("Date", ascending=True)
            merged_df_county.to_pickle(fpath)

            # number of new unique tweets appended
            new_tweets = existing_df_county.shape[0] - merged_df_county.shape[0]
            print("appended {} new tweets to county with fips code {}\n".format(new_tweets, fips_code))

        else:
            print("creating new file for county with fips code {}.".format(fips_code))
            sub_df_county.to_pickle(fpath)


--------------- Processing file 192.p ---------------
Dropped 230 tweets due to missing geolocation.

county file for fips code 06001 already exists.
appended 0 new tweets to county with fips code 06001

county file for fips code 06013 already exists.
appended 0 new tweets to county with fips code 06013

county file for fips code 06075 already exists.
appended 0 new tweets to county with fips code 06075

county file for fips code 06095 already exists.
appended 0 new tweets to county with fips code 06095

county file for fips code 06019 already exists.
appended 0 new tweets to county with fips code 06019

county file for fips code 06041 already exists.
appended 0 new tweets to county with fips code 06041

county file for fips code 06055 already exists.
appended 0 new tweets to county with fips code 06055

county file for fips code 06077 already exists.
appended 0 new tweets to county with fips code 06077

county file for fips code 06081 already exists.
appended 0 new tweets to county w

**This code calculates quantiles, mean and sd while resampling to a daily sampling. It should be applied once we have created the ordered, county-level tweet files.**

In [3]:
in_dir = "/Users/felix/Downloads/fips_sorted"
out_dir = "/Users/felix/Downloads/processed_daily"

in_flist = os.listdir(in_dir)

# here, on the very top we would have a loop over the raw files:
for fname in in_flist:
    print("\n--------------- Resampling file {} ---------------".format(fname))

    # read file, auto-convert dtypes and set date time index
    df = pd.read_pickle(os.path.join(in_dir, fname))

    # copy
    df_out = df.copy()

    # iterate over the sentiment score columns
    dfs_sentiments = []
    sentiment_cols = ['polarity', 'subjectivity', 'positive', 'negative', 'neutral']

    for col in sentiment_cols:
        # quantiles
        df_q = df_out[col].resample("D").quantile(q=[0., 0.025, 0.25, 0.5, 0.75, 0.095, 1.])
        df_q.index = df_q.index.set_names(["date", "quantile"])
        df_q = df_q.unstack()
        new_col = ["{}_{}".format(col, q) for q in df_q.columns]
        df_q = df_q.rename(columns=dict(zip(df_q.columns, new_col)))

        # mean and sd
        df_q["{}_mean".format(col)] = df_out[col].resample("D").mean()
        df_q["{}_sd".format(col)] = df_out[col].resample("D").std()

        # sum of retweets
        dfs_sentiments.append(df_q)

    # concatenate
    df_merged_daily = pd.concat(dfs_sentiments, axis=1)

    # add the daily sum of retweets
    df_merged_daily["retweets_total".format(col)] = df_out["Retweets"].resample("D").sum()
    
    # store daily resampled data to pickle file
    df_merged_daily.to_pickle(os.path.join(out_dir, fname))


--------------- Resampling file 06019.pkl ---------------

--------------- Resampling file 05077.pkl ---------------

--------------- Resampling file 05117.pkl ---------------

--------------- Resampling file 05107.pkl ---------------

--------------- Resampling file 01003.pkl ---------------

--------------- Resampling file 01001.pkl ---------------

--------------- Resampling file 12033.pkl ---------------

--------------- Resampling file 05001.pkl ---------------

--------------- Resampling file 06041.pkl ---------------

--------------- Resampling file 06055.pkl ---------------

--------------- Resampling file 06095.pkl ---------------

--------------- Resampling file 06081.pkl ---------------

--------------- Resampling file 06075.pkl ---------------

--------------- Resampling file 01097.pkl ---------------

--------------- Resampling file 05147.pkl ---------------

--------------- Resampling file 06077.pkl ---------------

--------------- Resampling file 05095.pkl -------------

A check a la std = 0 could be used to drop all columns that have zero variation over time. This is the case for several quantiles of the different sentiment scores.