# Project 5 - Leveraging Social Media to Map Natural Disasters
## Data Cleaning

## Table of Contents

1. [Imports](#Imports)
2. [Read in Dataframes](#Read-in-Dataframes)
3. [Data Cleaning](#Data-Cleaning)
    1. [Creating Target Variable](#Creating-Target-Variable)
    2. [Mapping Locations to Tweets](#Mapping-Locations-to-Tweets)
4. [Export Clean Data](#Export-Clean-Data)

### Imports

In [1]:
# imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup             
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
import matplotlib.pyplot as plt
import seaborn as sns

# Allows us to see whole cells (untruncated)
pd.set_option('display.max_colwidth', -1)

# setting a global random variable seed
np.random.seed(42)

%matplotlib inline

## Read in Dataframes

In [2]:
# Load hurricane data
hurricane = pd.read_csv('../datasets/hurricane_harvey.csv')
# Load floods data
floods = pd.read_csv('../datasets/floods.csv')
# Load mudslides data
mudslides = pd.read_csv('../datasets/mudslides.csv')
# Load noreaster data
noreaster = pd.read_csv('../datasets/noreaster.csv')
# Load tornado data
tornados = pd.read_csv('../datasets/tornados.csv')

### Clean Dataframes

In [3]:
# drop unncecessary columns from hurricane dataframe 
hurricane.drop(['Unnamed: 0', 'fullname', 'retweet_id', 'retweeter_userid', 'retweeter_username', 
    'timestamp', 'timestamp_epochs', 'tweet_id', 'user_id', 'username', 'dataframe'], axis = 1, inplace=True)

# drop unncecessary columns from floods dataframe 
floods.drop(['Unnamed: 0', 'fullname', 'retweet_id', 'retweeter_userid', 'retweeter_username', 
    'timestamp', 'timestamp_epochs', 'tweet_id', 'user_id', 'username', 'html'], axis = 1, inplace=True)

# drop unncecessary columns from mudslides dataframe 
mudslides.drop(['Unnamed: 0', 'fullname', 'retweet_id', 'retweeter_userid', 'retweeter_username', 
    'timestamp', 'timestamp_epochs', 'tweet_id', 'user_id', 'username'], axis = 1, inplace=True)

# drop unncecessary columns from noreaster dataframe 
noreaster.drop(['Unnamed: 0', 'fullname', 'retweet_id', 'retweeter_userid', 'retweeter_username', 
    'timestamp', 'timestamp_epochs', 'tweet_id', 'user_id', 'username', 'html'], axis = 1, inplace=True)

# drop unncecessary columns from tornados dataframe
tornados.drop(['Unnamed: 0', 'fullname', 'retweet_id', 'retweeter_userid', 'retweeter_username', 
    'timestamp', 'timestamp_epochs', 'tweet_id', 'user_id', 'username', 'html'], axis = 1, inplace=True)

In [4]:
# make a new column in each dataframe with the type of natural disaster it is
hurricane['type'] = 'hurricane'
floods['type'] = 'flood'
mudslides['type'] = 'mudslide'
noreaster['type'] = 'noreaster'
tornados['type'] = 'tornado'

In [5]:
# Combine dataframes
df = pd.concat([hurricane, floods, mudslides, noreaster, tornados], axis = 0)

In [6]:
# Reset index of new dataframe
df.reset_index(drop = True, inplace = True)

## Data Cleaning
#### Drop Duplicates

In [7]:
# Check shape of dataframe
df.shape

(24061, 7)

In [8]:
# Dropping duplicates
df.drop_duplicates(subset='text', inplace=True)

In [9]:
# Re-check shape of dataframe
df.shape

(22901, 7)

#### Check for Nulls

In [10]:
df.isnull().sum()

is_retweet    1
likes         1
replies       1
retweets      1
text          1
tweet_url     5
type          0
dtype: int64

In [11]:
# Drop nulls
df.dropna(inplace=True)

In [12]:
# Re-check null values
df.isnull().sum()

is_retweet    0
likes         0
replies       0
retweets      0
text          0
tweet_url     0
type          0
dtype: int64

#### Romoving Unnecessary Punctuations

In [13]:
# Make sure all tweets in the text column are strings
df['text'] = df['text'].astype(str)

In [14]:
# Use regext to remove unnecessary punctuation 
df["text"] = df["text"].apply(lambda x: re.sub(r"http\S+", "", x).lower())
df["text"] = df["text"].apply(lambda x: re.sub(r"pic.twitter\S+", "", x))
df["text"] = df["text"].apply(lambda x: re.sub('[^ a-zA-Z!#911]','', x))

We decided to remove links to other websites and picture links because we wanted to use just the pure text to train our model. Additionally we wanted to remove unnecessary punctuation except !, #, and the digits 911 because they either convey urgency or in the case of the hashtag it is widely used to get attention to your tweet.

### Creating Target Variable

In [15]:
critical_words = ['fatality', 'destruction', 'rescue', 'stranded', 'stuck', 'injured', 'lost', 'dying', 'danger',
              'medivac', 'sos', 'save me', 'save us', 'debris', 'injury', 'drowning', 'ambulance', 'doctor', 
              'help us', 'help me', 'fire', 'life-threatening', 'starving', 'broke', 'please help', 'ambulance', 
              '911', 'casualty', 'death', 'need help'
]

To create our target variable we must determine which tweets may constitute calls for emergency help. In order to categorize tweets we used a list of words that we decided would be used in a situation where an individual or group of individuals needed immediate help.

In [16]:
emergency_tweets = [] 
for tweet in df['text']:
    for word in critical_words:
        if word in tweet: 
            emergency_tweets.append(tweet)

In [17]:
df["target"] = df["text"].apply(lambda x: 1 if x in emergency_tweets else 0)

In [18]:
df['target'].value_counts()

0    20694
1    2202 
Name: target, dtype: int64

Clearly the classes are unbalanced but that is to be expected because the majority of tweets 

In [19]:
df.reset_index(inplace=True)

### Mapping Locations to the Tweets

We have assigned out target values, and we will now assign coordinates to each tweet. In an ideal scenario, this data would be able to be directly accessed from the tweets themselves, but due to out limited access to tweet information, these assumed cordinates will stand in as a proof-of-concept. <br>
<br>
Our targeted area of disaster occurence is going to be Houston, Texas. While our dataset was created from multiple disasters that happened in multiple cities around the US, we are going to simplify the visualization. The boundaries of the Houston area that we picked are:

top left: 30.105670, -96.075520

top right: 30.105670, -94.798950

bottom left: 29.333485, -96.075520

bottom right: 29.333485, -94.798950

From these points, we can make our latitude and longitude boundaries to map tweets to. We will sample from a uniform distribution between these points to make our locations. As an additional layer of interest, our target tweets will be mapped in 6 different zones. 5 of them will be specific "disaster zones" to mark the hotspots of critical areas, and the sixth will be mapped throughout the full area in the same fashion as the null targets. The null targets, or tweets that are not in need of assistance, will be distributed across the entire area.

In [20]:
# creating Houston boundary coordinates
total_area = {
    "lat": [29.333485, 30.105670],
    "long": [-96.243067, -94.798950]
}

In [21]:
# establishing each disaster zone
zone_1 = {
    "lat": [29.777257, 30.020122],
    "long": [-96.131424, -96.079262]
}
zone_2 = {
    "lat": [29.752092, 29.784874],
    "long": [-95.579720, -95.414298]
}
zone_3 = {
    "lat": [29.993420, 30.031473],
    "long": [-95.534265, -95.303653]
}
zone_4 = {
    "lat": [29.859171, 29.884774],
    "long": [-95.334881, -95.283405]
}
zone_5= {
    "lat": [29.504332, 29.524648],
    "long": [-95.128942, -95.017089]
}

In [22]:
# creating the new coordinate columns
df["latitude"] = np.nan
df["longitude"] = np.nan

The targets are going to be separated into different dataframes, have the locations asttached accordingly, and then will concatenated back together.

In [23]:
# splitting the dataframe into the two targets for easier mapping
null_target = df[df["target"] == 0]
null_target.head()

Unnamed: 0,index,is_retweet,likes,replies,retweets,text,tweet_url,type,target,latitude,longitude
0,0,0.0,1,0.0,0.0,praying for yall in texas! #hurricaneharvey,/RedSoxNation52/status/901232720713469952,hurricane,0,,
1,1,0.0,1,0.0,3.0,fyi #hurricaneharvey,/nataliereyy/status/901232720088637446,hurricane,0,,
2,2,0.0,1,0.0,0.0,my prayers goes to everyone in texas being affected by #hurricaneharvey please be safe and ok,/sothiachhoeum2/status/901232719312670720,hurricane,0,,
3,3,0.0,11,2.0,1.0,#hurricaneharvey is coming we are bunkering down trying to save battery life will more than likely lose power prayers for our state xo,/tayslade/status/901232707455389696,hurricane,0,,
4,4,0.0,0,0.0,0.0,jim cantore is wearing a modded out baseball helmet on #theweatherchannel right now #safetyfirst #hurricaneharvey,/RebeccaBennitt/status/901232706247417856,hurricane,0,,


In [24]:
pos_target = df[df["target"] == 1]
pos_target.head()

Unnamed: 0,index,is_retweet,likes,replies,retweets,text,tweet_url,type,target,latitude,longitude
6,6,0.0,2,0.0,0.0,at live at #hurricaneharvey plus local help on the way rescued tiger cub check up heatwave brings out snakes fire danger hs football,/KathleenFOX5/status/901232696533196800,hurricane,1,,
172,192,0.0,0,0.0,0.0,my heart is broken after hearing this praying for everyone in rockport #hurricaneharvey,/MariaPerezTW/status/901232218831544325,hurricane,1,,
197,217,0.0,1,0.0,0.0,uscoastguard rescues 1 from offshore supply ship near port mansfield #rgv #rgvwx #hurricaneharvey,/BrownsvilleNews/status/901232121133617153,hurricane,1,,
213,233,0.0,12,0.0,7.0,that code blue texas am cc just sent me felt like all hope is lost fuck you #hurricaneharvey,/robdoubleA/status/901232069170397185,hurricane,1,,
289,310,0.0,0,0.0,0.0,maybe ted cruz will save us #hurricaneharvey houston texas,/R_Phillip/status/901231724436361218,hurricane,1,,


In [25]:
# applying random latitudes and longitudes to the null dataset
null_target["latitude"] = null_target["latitude"].apply(
    lambda x: round(np.random.uniform(total_area["lat"][0], total_area["lat"][1]), 6))

null_target["longitude"] = null_target["longitude"].apply(
    lambda x: round(np.random.uniform(total_area["long"][0], total_area["long"][1]), 6))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [26]:
# checking for the new coords
null_target.head()

Unnamed: 0,index,is_retweet,likes,replies,retweets,text,tweet_url,type,target,latitude,longitude
0,0,0.0,1,0.0,0.0,praying for yall in texas! #hurricaneharvey,/RedSoxNation52/status/901232720713469952,hurricane,0,29.622699,-95.2434
1,1,0.0,1,0.0,3.0,fyi #hurricaneharvey,/nataliereyy/status/901232720088637446,hurricane,0,30.067612,-95.696869
2,2,0.0,1,0.0,0.0,my prayers goes to everyone in texas being affected by #hurricaneharvey please be safe and ok,/sothiachhoeum2/status/901232719312670720,hurricane,0,29.89872,-95.837471
3,3,0.0,11,2.0,1.0,#hurricaneharvey is coming we are bunkering down trying to save battery life will more than likely lose power prayers for our state xo,/tayslade/status/901232707455389696,hurricane,0,29.79576,-95.285158
4,4,0.0,0,0.0,0.0,jim cantore is wearing a modded out baseball helmet on #theweatherchannel right now #safetyfirst #hurricaneharvey,/RebeccaBennitt/status/901232706247417856,hurricane,0,29.45396,-94.826653


In [27]:
# splitting the pos_target df into the different zones
z1, z2, z3, z4, z5, z6 = np.split(pos_target, 6)

To make the process of setting disaster zones easier, a function will be made that will take two lists, the dataframes, and the coordinate lists. This function will then attach the random zone coordinates accordingly.

In [28]:
# making a function to apply the random coordinates to each df
def set_coord(df_list, coords_list):
    
    #looping through the df list while referencing indices
    for i, df in enumerate(df_list):
        
        # applying the random lat from the specificed zone
        df["latitude"] = df["latitude"].apply(
            lambda x: round(np.random.uniform(coords_list[i]["lat"][0], coords_list[i]["lat"][1]), 6))
        
        # applying the random long from the specified zone
        df["longitude"] = df["longitude"].apply(
            lambda x: round(np.random.uniform(coords_list[i]["long"][0], coords_list[i]["long"][1]), 6))

In [29]:
# creating a list of all the target zones
target_zones = [z1, z2, z3, z4, z5, z6]

# creating a list of all the coord zones
coord_zones = [zone_1, zone_2, zone_3, zone_4, zone_5, total_area]

In [30]:
# running the function to assign the coordinates
set_coord(target_zones, coord_zones)

In [31]:
# checking the dfs for the new coordinates
z1.head()

Unnamed: 0,index,is_retweet,likes,replies,retweets,text,tweet_url,type,target,latitude,longitude
6,6,0.0,2,0.0,0.0,at live at #hurricaneharvey plus local help on the way rescued tiger cub check up heatwave brings out snakes fire danger hs football,/KathleenFOX5/status/901232696533196800,hurricane,1,29.962051,-96.125356
172,192,0.0,0,0.0,0.0,my heart is broken after hearing this praying for everyone in rockport #hurricaneharvey,/MariaPerezTW/status/901232218831544325,hurricane,1,29.814778,-96.102689
197,217,0.0,1,0.0,0.0,uscoastguard rescues 1 from offshore supply ship near port mansfield #rgv #rgvwx #hurricaneharvey,/BrownsvilleNews/status/901232121133617153,hurricane,1,29.796093,-96.09514
213,233,0.0,12,0.0,7.0,that code blue texas am cc just sent me felt like all hope is lost fuck you #hurricaneharvey,/robdoubleA/status/901232069170397185,hurricane,1,29.930451,-96.110322
289,310,0.0,0,0.0,0.0,maybe ted cruz will save us #hurricaneharvey houston texas,/R_Phillip/status/901231724436361218,hurricane,1,29.779674,-96.093268


Now we have 7 different dataframes, and need to bring them all back into one. They still all have the same columns, so they can simply be concatenated back together.

In [32]:
# concat-ing all the dfs
df = pd.concat([null_target, z1, z2, z3, z4, z5, z6])

In [33]:
# checking for the right target split
df["target"].value_counts()

0    20694
1    2202 
Name: target, dtype: int64

In [34]:
df = df[df['text'] != '']

Some of the tweets ended up empty because of regex so to avoid NaNs when reading the dataframe we removed them.

## Export Clean Data

In [35]:
# Exported cleaned data to datasets folder
df.to_csv('../datasets/clean_df.csv', index=False)