#### [NLP Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/data)

In [1]:
import pandas as pd


Read in Data

In [2]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Basic Exploration

In [3]:
# the vast majority of rows have a non-null keyword
# roughly 2/3rds of records have location
data_count = train_df.count()
print("row count: {}".format(data_count))

print("\nTrue positive rate")
print(1-train_df["target"].mean())

row count: id          7613
keyword     7552
location    5080
text        7613
target      7613
dtype: int64

True positive rate
0.5703402075397347


#### Keywords
Based on the table below, it looks like certain keywords are much more strongly correlated to TP than others.

In [4]:
tp_by_keyword_df = (train_df
                    .fillna(value={"keyword": ""})
                    .groupby(by="keyword")
                    .agg(["count", "mean"])["target"]
                    .sort_values("mean", ascending=False))
tp_by_keyword_df

Unnamed: 0_level_0,count,mean
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1
wreckage,39,1.000000
debris,37,1.000000
derailment,39,1.000000
outbreak,40,0.975000
oil%20spill,38,0.973684
...,...,...
body%20bag,33,0.030303
blazing,34,0.029412
ruin,37,0.027027
body%20bags,41,0.024390


#### Location

This field is much more sparse than "keyword". To reduce sparcity, we could use a model to place individual cities within a larger geographic area.

In [5]:
tp_by_keyword_df = (train_df
                    .groupby(by="location")
                    .agg(["count", "mean"])["target"]
                    .sort_values("count", ascending=False))
tp_by_keyword_df

Unnamed: 0_level_0,count,mean
location,Unnamed: 1_level_1,Unnamed: 2_level_1
USA,104,0.644231
New York,71,0.225352
United States,50,0.540000
London,45,0.355556
Canada,29,0.448276
...,...,...
Hueco Mundo,1,0.000000
"Hughes, AR",1,1.000000
"Huntington, WV",1,0.000000
"Huntley, IL",1,0.000000
