# Election Tweets: Geolocation

To understand just how real the "red state/blue state" divide is on the tweet level, I need some geolocation data. Since only very few users (<2%) activate the geotagging feature on their user accounts, I'll need to get this information from other sources. I decided to extract this info from people's self-reported location (in their user profile), but now I want to see what the data looks like and what information I can extract to put towards modelling.

In [153]:
import pandas as pd
import random

import warnings
warnings.filterwarnings('ignore')

In [41]:
tweets = pd.read_csv('/Users/laraehrenhofer/Documents/Coding_Projects/git_repos/tweet-the-people-legacy/data/tweet_pg.csv')

## Step 1: Understanding geolocation by numbers

Some introductory questions:
- How many tweets have geotagged location vs. location based on profile info?
- Any differences by Republican/Democrat ticket?

### Geotagged vs. profile geolocation

Some basic cleanup is necessary as some of these locations were generated in development versions of the tweet streaming script.

In [134]:
loc_no = pd.DataFrame(tweets.groupby(['loc_type', 'ticket']).count()['tweet_id']).unstack()

In [135]:
keep_rows = ['bound_box_coords', 'no_loc', 'user_loc']
drop_rows = [column for column in loc_no.index.values if column not in keep_rows]

In [136]:
loc_no = loc_no.drop(drop_rows, axis=0)
loc_no = pd.DataFrame(loc_no.to_records())
loc_no.columns = ['loc_type', 'Democrat', 'Republican']
loc_no['total'] = loc_no['Democrat'] + loc_no['Republican']

In [137]:
def get_ticket_percent(data, ticket):
    colname_total = f'percent_of_total_{ticket}'
    colname_ticket = f'percent_of_ticket_{ticket}'
    data[colname_total] = data[ticket].apply(lambda x: round((x/sum(data['total']))*100, 2))
    data[colname_ticket] = data[ticket].apply(lambda x: round((x/sum(data[ticket]))*100, 2))
    return data

In [138]:
tickets = ['Democrat', 'Republican']

for ticket in tickets:
    loc_no = get_ticket_percent(loc_no, ticket)

In [142]:
loc_no['total_percent'] = loc_no['total'].apply(lambda x: round((x/sum(loc_no['total']))*100, 2))

In [143]:
loc_no

Unnamed: 0,loc_type,Democrat,Republican,total,percent_of_total_Democrat,percent_of_ticket_Democrat,percent_of_total_Republican,percent_of_ticket_Republican,total_percent
0,bound_box_coords,851.0,401.0,1252.0,0.27,0.45,0.13,0.31,0.4
1,no_loc,78283.0,50747.0,129030.0,24.75,41.53,16.04,39.7,40.79
2,user_loc,109385.0,76665.0,186050.0,34.58,58.02,24.24,59.98,58.81


#### Conclusions

1. Less than half a percent of this tweet database has 'ground-truth' geotagged location data.
2. 40% has no location at all.
3. But between geotagging and user profile location data, we still have close to 60% of the dataset available.
4. Initially it looks like there's a partisan divide in terms of people's provision of geolocation data (looking at the `percent_of_total` columns). However, breaking it down into percent by ticket suggests that there isn't an enormous contrast between the two populations after all, this is just an artefact of there being slightly more Democrat than Republican tweets overall. (Could determine further statistical information concerning probability of having a location based on ticket using a logistic regression here but it doesn't seem interesting enough to warrant the extra attention.)

## Step 2: Quality check

I used the `geocoder` package to get over 42k locations based on people's user profiles. While some folks' self-reported location is plausible ("Milwaukee-Chicago"), other locations are less straightforward to map onto real-world locations ("Marvel Universe", "hell since 2016", "God's Country", "Always butter the Pan"). What did `geocoder` make of these less plausible locations?

How to check: Sample 1000 locations and manually check. (I want a GUI for this and am doing it in good old Excel.)
- What proportion of these is a joke location? This will help us get a sense of the amount of noise in the data.
- What does `geocoder` make of the joke locations?

In [164]:
list_locs = list(tweets['location'].unique())
len(list_locs)

45219

In [165]:
# separate out just the tweets that have a profile-based location

loc_tweets = tweets[tweets['location'] == 'user_loc']

In [166]:
loc_sample = loc_tweets.sample(n=1000)

In [169]:
loc_sample = loc_sample.drop([column for column in loc_sample.columns if column not in ['location', 'loc_type', 'us_state', 'loc_lat', 'loc_lon']], axis=1)
loc_sample.to_csv('./location_random_sample.csv', index=False)

**Manual data annotation of random sample:**

1. Binary manual classification into feature `unclear_loc` (0, 1). Some examples of `unclear_loc == 1` in this particular sample:
    - a `location: Right now? Arkansas`
    - b `Earth`
    - c `In Trump's Nightmares`
    - d `COviNGTON va BAbY!!!!`
    - e `Nicht Bielefeld`
    - f `None of your business`
    - g `NoVa`
    - h `The Dirty South-GA`
2. Binary manual classification into feature `implausible_assigned_loc` (0, 1). My highly subjective criteria were:
    - `implausible_assigned_loc == 0`: if `geocoder` managed to get correct information out of an unclear location (e.g. assigning location `Arkansas` to example a, `Virginia` to example d), or classifies it as `other` (e.g. assigning `other` to example e).
    - `implausible_assigned_loc == 1`: odder assignments get a rating of 1, e.g. assigning location b to Texas, c to Maryland, f to Washington, or g to Ohio ("NoVa" is short for Northern Virginia), or failing to get location information where technically present (e.g. assigning `other` to example h).
    
**Next up:** let's get a sense of the scale of the noise.
1. What percentage of locations are unclear?
2. Of these, what percentage are implausible and therefore likely erroneous?
3. What's the likely percentage of erroneous assignments overall?

In [196]:
loc_sample_annotated = pd.read_csv('./location_random_sample_13012021.csv')

In [197]:
loc_sample_annotated.dtypes

loc_lat                     float64
loc_lon                     float64
loc_type                     object
location                     object
us_state                     object
unclear_loc                   int64
implausible_assigned_loc      int64
dtype: object

In [198]:
percent_unclear = (sum(loc_sample_annotated['unclear_loc'])/len(loc_sample_annotated))*100
percent_unclear

18.410462776659962

In [199]:
percent_implausible = (sum(loc_sample_annotated['implausible_assigned_loc'])/len(loc_sample_annotated[loc_sample_annotated['unclear_loc'] == 1]))*100
percent_implausible

44.80874316939891

In [200]:
percent_erroneous_total = (sum(loc_sample_annotated['implausible_assigned_loc'])/len(loc_sample_annotated))*100
percent_erroneous_total

8.249496981891348

#### Conclusions from manual data quality check

In this sample, nearly one-fifth of locations were unclear; just under half of those were not assigned a correct location by `geocoder`. Overall, this results in a ca. 8% rate of poor location information.

This is pretty rough news, as it means there's quite a lot of noise in the geolocation data!

**How to get around this?**

Options from various domains (which could also help impute missing location data for the users who provided none at all):

1. **Hashtag tracking:** If the tweet contains a hashtag, in which state is this hashtag most popular?
    - Conditional probabilities/TF-IDF: given a particular hashtag, what's the likelihood of it being tweeted from a specific state?
    - Use a Bayesian model to predict state from hashtag?
        - This is risky as the model will be learning from poorly labelled (noisy) data.
        - Are there enough geotagged tweets containing hashtags in order to build a Bayesian model restricted to just this data subset, and then extrapolate from there to the wider dataset?
        - Benchmark model performance on the ground-truthed, human-annotated reference sample; if performance is reasonable, apply to the rest of the dataset
    
2. **Social network clustering:** Can we interpolate a user's location from the people they follow and their locations?
    - The dataset contains tweets from >220k unique users
    - Steps would be:
        - Grab user's followed accounts (Twitter API limits this to 5k followers at a time, this seems like more than enough)
        - Get those users' locations where possible (this will have the same problem with noise in self-reported locations)
        - Interpolation: copy location of the account that a user most frequently interacts with? Does Twitter provide this data in some summary form? If so, is the location of a high-interaction account actually a good proxy for a user's location? (Boils down to: do people interact more with people who live close to them? Probably some do and some don't.)
        
        
I'm going to try hashtag tracking first.

## Step 3: Location Interpolation -- Conditional probabilities