# Vast Mini Challenge 03

This notebook aims to perform an exploratory analysis of the data from the _VAST Mini Challenge 03_, in order to answer the three questions proposed through data visualization. 

The .csv and .geojson files of the Abila map were obtained from the conversion of the files provided by the mini challenge (present in the Abila folder) using the tool https://mapshaper.org. Just access the site, send the Abila folder with the 4 files inside and click export as .csv and .geojson. 

With the geojson obtained, the code from the **geocode.js** file was used, which loads the geojson file of the Abila map and the csv file of reports. With that, it was iterated about the locations of the reports and obtained their coordinates. After that, a .json file with the coordinates was obtained. This .json is called **coordinates.json**

All these generated files are in the folder _abilaMap_processed_. 

**Attention**: it takes a few steps to run the js code used to generate the files. However, let's skip these steps and use the files already generated. 

In [2]:
import pandas as pd
import spacy as sp
from datetime import datetime
from nltk.corpus import stopwords
# import plotly.express as px
# import altair as alt

### Abila Map: Locations' Coordinates

## Preprocessing
The preprocessing includes punctuation and stopword removal and lemmatization. After that, calls ('ccdata' type) and posts ('mbdata' type) are separated into different files.

In [None]:
# Loading the appropriate model
#
# Before doing so, they need to be installed. Choose one (or both, for safe measure) of them:
#  - Bigger, slower, but more accurate: python -m spacy download en_core_web_trf
#  - Small, faster, but less accurate: python -m spacy download en_core_web_sm
nlp = sp.load("en_core_web_trf")

In [134]:
# Loading and preprocessing the data sets.
#
# Time periods:
#  - Period one: 1700-1830
#  - Period two: 1831-2000
#  - Period three: 2001-2131

periods = ("1700-1830", "1831-2000", "2001-2131")
data_frames = [pd.read_csv(f"original_csv/csv-{period}.csv") for period in periods]
# stop_words = stopwords.words("english")

# for index, data_frame in enumerate(data_frames):
#     print(f"[{index + 1}/{len(data_frames)}] Processing data-frames...")
#     data_frame_length = data_frame.shape[0]

#     for row in range(data_frame_length):
#         print(f" |_ [{row + 1}/{data_frame_length}] Processing row...", end="\r")

#         message = data_frame.loc[row, "message"]
#         document = nlp(message)
#         tokens = [
#             token.lemma_ for token in document
#             if token.text not in stop_words  # Remove stop words
#             and token.is_punct is False      # Remove punctuation
#         ]

#         data_frame.loc[row, "message"] = " ".join(tokens)

#     print(f" |_ [{data_frame_length}/{data_frame_length}] Processing completed.")

combined_csv = pd.concat(data_frames)
combined_csv.to_csv("processed/combined_csv.csv", index=False)

In [None]:
# Separate posts and calls (reports)
combined_csv = pd.read_csv("processed/combined_csv.csv")
combined_csv['timestamp'] = combined_csv['date(yyyyMMddHHmmss)'].apply(lambda t : datetime.strptime(str(t),'%Y%m%d%H%M%S').strftime("%Y-%m-%d %H:%M:%S"))
combined_csv = combined_csv.drop(['date(yyyyMMddHHmmss)'], axis='columns')

posts = combined_csv[combined_csv['type'] == 'mbdata']

posts.to_csv("processed/posts.csv", index=False)

reports = combined_csv[combined_csv['type'] == 'ccdata']

reports = reports.drop(['author', 'longitude', 'latitude'], axis='columns')\
    .rename(columns={' location' : 'location'})

reports.to_csv("processed/reports.csv", index=False)

In summary, we have 4 files:

In [19]:
posts = pd.read_csv("processed/posts.csv") # tweets
reports = pd.read_csv("processed/reports.csv") # reports (emergency calls)
combined_csv = pd.read_csv("processed/combined_csv.csv") # csv containing the three periods (1700-1830, 1831-2000, 2001-2131)
coordinates = pd.read_json("abilaMap_processed/coordinates.json") # mapping from location to coordinates

In [20]:
combined_csv.info() # 147 non-null entries in latitude and longitude

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4063 entries, 0 to 4062
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  4063 non-null   object 
 1   date(yyyyMMddHHmmss)  4063 non-null   int64  
 2   author                3872 non-null   object 
 3   message               4063 non-null   object 
 4   latitude              147 non-null    float64
 5   longitude             147 non-null    float64
 6    location             176 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 222.3+ KB


In [21]:
for i in combined_csv.loc[~combined_csv.loc[:, " location"].isna()].index:
    new_coordinates = coordinates.loc[coordinates.loc[:, "location"] == combined_csv.loc[i, " location"]]
    
    try:
        new_longitude, new_latitude = new_coordinates.reset_index().iloc[0, -1]
        combined_csv.loc[i, "longitude"], combined_csv.loc[i, "latitude"] = new_longitude, new_latitude
    except:
        continue
        
combined_csv = combined_csv.drop(' location', 1)

In [22]:
combined_csv.info() # now, there are 319 latitude and longitude entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4063 entries, 0 to 4062
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  4063 non-null   object 
 1   date(yyyyMMddHHmmss)  4063 non-null   int64  
 2   author                3872 non-null   object 
 3   message               4063 non-null   object 
 4   latitude              319 non-null    float64
 5   longitude             319 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 190.6+ KB


In [23]:
combined_csv

Unnamed: 0,type,date(yyyyMMddHHmmss),author,message,latitude,longitude
0,mbdata,20140123170000,POK,Follow us @POK-Kronos,,
1,mbdata,20140123170000,maha_Homeland,Don't miss a moment! Follow our live coverage...,,
2,mbdata,20140123170000,Viktor-E,Come join us in the Park! Music tonight at Abi...,,
3,mbdata,20140123170000,KronosStar,POK rally to start in Abila City Park. POK lea...,,
4,mbdata,20140123170000,AbilaPost,POK rally set to take place in Abila City Park...,,
...,...,...,...,...,...,...
4058,mbdata,20140123213310,plasticParts,RT @AbilaPost unknown explosion heard from the...,,
4059,mbdata,20140123213345,klingon4real,RT @CentralBulletin explosion heard at dancing...,,
4060,mbdata,20140123213400,lindyT,RT @KronosStar There has been an explosion fro...,,
4061,mbdata,20140123213400,dolls4sale,RT @redisrad What was that? #boom,,


### Spam trigger words
The words were collected from the website https://outfunnel.com/spam-trigger-words/ 

In [13]:
with open('spam_words.txt', encoding='UTF-8') as f:
    spam_words = f.readlines()
    
spam_words = [x.strip() for x in spam_words]
spam_words

['0%',
 'Access now',
 'Bargain',
 '0% risk',
 'Access for free',
 'Believe me',
 '777',
 'Act now',
 'Big bucks',
 '99%',
 'Act immediately',
 'Billing',
 '99.9%',
 'Action required',
 'Billing address',
 '100%',
 'Additional income',
 'Billionaire',
 '100% more',
 'Affordable deal',
 'Billion dollars',
 '100% satisfied',
 'Apply online',
 'Best offer',
 '$$$',
 'At no cost',
 'Bulk email',
 '4U',
 'Auto email removal',
 'Buy direct',
 'Call me',
 'Deal',
 'Earn',
 '$',
 'Call now',
 'Debt',
 'Follow',
 'Earn extra income',
 'Calling creditors',
 'Direct email',
 'Earn money',
 'Cancel at any time',
 'Discount',
 'Earn monthly',
 'Cannot be combined',
 'Do it now',
 'Eliminate bad credit',
 'Cards accepted',
 'Do it today',
 'Eliminate debt',
 'Cash-out',
 'Don’t delete',
 'Email marketing',
 'Cash bonus',
 'Don’t hesitate',
 'Exclusive deal',
 'Click here',
 'Double your cash',
 'Expire',
 'Congratulations',
 'Double your income',
 'Extra cash',
 'Fantastic deal',
 'Get it now',
 'Hi