### Copyright
Copyright (c) <2022>, <Regina Nockerts>
All rights reserved.

This source code is licensed under the BSD-style license found in the
LICENSE file in the root directory of this source tree. 


In [1]:
import pandas as pd
import numpy as np
import re
from nlpUtils import aardvark as aa 
import os.path

In [None]:
# For the full dataset
tweets_data = pd.read_csv(os.path.join('archiveData', 'tweets_caps_27b_04.csv'), header=0, index_col=0)

print(tweets_data.shape)
tweets_data.tail()

# Load the data

# Strategy for non-relevant rows in the main dataframe
I did not want to further complicate the twitter search, so I gathered everything related to all refugees. Now it's time to get rid of rows that are not about Afghani refugees. So, find rows that relate to other refugee situations AND not Afghanistan, and remove them.

First, I need a list of safe words: words that will **keep** a tweet in the dataset by default (list made via the iterations of searching below)

SEND these rows to a keep_df. Then...

Then, **remove** anything else that has one of the following terms (list made via the iterations of searching below)

SEE what's left.


## Find likely terms for exclusion
Start by looking for likely terms by reviewing a random subset of the full scrape. 

The below have been identified as likely:<br>
('arkham', 6), 
('gotham', 3), 
('batman', 3), 
('rohinga', 5), 
('rohingya', 466), 
('syrian', 2823), 
('syria', 1753), 
('tamil', 297), 
('tigray', 271), 
('sudan', 230), 
('sudanese', 112), 
('somalia', 255), 
('somali', 238), 
('congo', 72), 
('congalese', 1), 
('congolese', 59), 
('eritrea', 125), 
('eritrean', 479), 
('apo', 2054), 
('nigeria', 226), 
('nigerian', 146), 
('uganda', 444), 
('ugandan', 206), 
('rwamwanja', 3), 
('kirwa', 0), 
('biloela', 468), 
('iraq', 2178), 
('iraqi', 20083), 
('yemen', 462), 
('yemeni', 195), 
('rwanda', 1065), 
('rwandan', 88), 
('kenya', 212), 
('kenyan', 73), 
('kashmir', 1091), 
('kashmiri', 1850), 
('palestine', 331), 
('palestinian', 514), 
('haiti', 352), 
('haitian', 646), 
('nigerian', 110), 
('tesla', 2), 
('tsla', 1).

Be careful, but look at these, too:
('patience', 123), 
('continued patience', 0), 
('rental', 100), 
('tenant', 27), 
('tenants', 56), 
('buddhist', 25), 
('hindu', 584), 

('vietnam', 402)
('vietnamese', 653)
('cambodia', 66)
('cambodian', 52)
('indonesia', 5813), 
('indonesian', 280), 

('fire', 809), 
('flood', 440), 
('landslide', 49).

In [6]:
term_list = [
    #EXCLUSION
    "arkham", "gotham", "batman",
    "rohinga", "rohingya",
    "syrian", "syria",
    "tamil",
    "tigray",
    "sudan", "sudanese",
    "somalia", "somali",
    "congo", "congalese", "congolese",
    "eritrea", "eritrean",
    "apo",
    "nigeria", "nigerian",
    "uganda", "ugandan",
    "rwamwanja",
    "kirwa", 
    "biloela",
    "iraq", "iraqi",
    "yemen", "yemeni",
    "rwanda", "rwandan",
    "kenya", "kenyan",
    "kashmir", "kashmiri",
    "palestine", "palestinian",
    "haiti", "haitian",
    "tesla", "TSLA",
    "ukraine", "ukrainian",
    "your patience", "your continued patience",
    "rental", "tenant", "tenants",
    "buddhist", "hindu"]

# Second exclusion round
    # "Australia"
    # "vietnam", "vietnamese",
    # "cambodia", "cambodian",
    # "indonesia", "indonesian",
    # "fire", "flood", "landslide", "cyclone"]

results = []
for i in term_list:
    results.append(aa.term_check(i, tweets_data))
results

[('arkham', 1),
 ('gotham', 2),
 ('batman', 1),
 ('rohinga', 1),
 ('rohingya', 316),
 ('syrian', 1379),
 ('syria', 431),
 ('tamil', 280),
 ('tigray', 259),
 ('sudan', 115),
 ('sudanese', 71),
 ('somalia', 82),
 ('somali', 158),
 ('congo', 43),
 ('congalese', 1),
 ('congolese', 49),
 ('eritrea', 61),
 ('eritrean', 465),
 ('apo', 2054),
 ('nigeria', 157),
 ('nigerian', 127),
 ('uganda', 225),
 ('ugandan', 132),
 ('rwamwanja', 2),
 ('kirwa', 0),
 ('biloela', 446),
 ('iraq', 951),
 ('iraqi', 18548),
 ('yemen', 88),
 ('yemeni', 66),
 ('rwanda', 894),
 ('rwandan', 83),
 ('kenya', 180),
 ('kenyan', 66),
 ('kashmir', 1027),
 ('kashmiri', 1805),
 ('palestine', 256),
 ('palestinian', 453),
 ('haiti', 117),
 ('haitian', 251),
 ('tesla', 1),
 ('tsla', 1),
 ('ukraine', 1353),
 ('ukrainian', 2403),
 ('your patience', 0),
 ('your continued patience', 0),
 ('rental', 60),
 ('tenant', 20),
 ('tenants', 47),
 ('buddhist', 16),
 ('hindu', 328)]

FOR THE FULL DATASET

So, most of these are uncommon: fewer than 1,000 rows (~0.3%). Which is good: my search terms are being pretty precise.

But some seem uncomfortably high: 
('syria/n', 1753 + 2823), 
('apo', 2054), 
('iraq/i', 2178 + 20,083), 
('rwanda/n', 1,065 + 88), 
('kashmir/i', 1,091 + 1,850), 
NOTE: Palestine and Ukraine should be here too, but the terms were identified later.


And some just need a closer look in general:
('patience', 123), 
('continued patience', 0), 
('rental', 100), 
('tenant', 27), 
('tenants', 56), 
('buddhist', 25), 
('hindu', 584), 
('vietnam/ese', 402 + 653)
('indonesia/n', 5,813 + 280), 
('fire', 809), 
('fire', 809), 
('flood', 440), 
('landslide', 49).

FOR THE EVAL DATASET:

About half the data, so looking for terms with more than 500 rows...

 ('syrian', 1379),
 ('syria', 431),
 ('eritrea', 61),
 ('eritrean', 465),
 ('apo', 2054),
 ('iraq', 951),
 ('iraqi', 18548),
 ('rwanda', 894),
 ('rwandan', 83),
 ('kashmir', 1027),
 ('kashmiri', 1805),
 ('palestine', 256),
 ('palestinian', 453),
 ('ukraine', 1353),
 ('ukrainian', 2403),


### Examine non-Afghan rows
So. I need a better look at the potentially irrelevant rows. Let's make a subset of the df with all the rows with afgha* removed, and then see what is left.

In [8]:
# NOTE: This should pull from the cleanest possible .csv into a NEW df that we can flag and delete from
tweets_nonAfg = pd.read_csv('tweets_caps_27b_04.csv', header=0, index_col=0)
tweets_nonAfg.insert(loc=3, column='Flag', value="no")
print(tweets_nonAfg.shape)
tweets_nonAfg.tail()

(384241, 15)


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
384236,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",no,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
384237,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,no,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
384238,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",no,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
384239,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,no,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags
384240,2021-01-01 00:06:57+00:00,@postcovid_CH @WHO @pahowho @WHOWPRO @WHOAFRO ...,"Also, there are many c...",no,7,0.023333,I,,"@postcovid_CH, @WHO, @pahowho, @WHOWPRO, @WHOA...",English-speaking,0,0,3,0,No hashtags


I checked https://www.thefreedictionary.com/words-containing-afgh - it looks like there are no words containing "afgh" that we need to worry about; we can just kick out these rows.

Same with "kabul"

In [9]:
# NOTE: This function only flags positive instances; it may be run multiple times with different terms
# NOTE: This function starts with an input: to reset the index
aa.flag_term("afg", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("kabul", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

tweets_nonAfg.tail()

yes    227112
no     157129
Name: Flag, dtype: int64
yes    227728
no     156513
Name: Flag, dtype: int64


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
384236,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",no,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
384237,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,no,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
384238,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",no,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
384239,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,no,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags
384240,2021-01-01 00:06:57+00:00,@postcovid_CH @WHO @pahowho @WHOWPRO @WHOAFRO ...,"Also, there are many c...",yes,7,0.023333,I,,"@postcovid_CH, @WHO, @pahowho, @WHOWPRO, @WHOA...",English-speaking,0,0,3,0,No hashtags


### PAUSE / UNPAUSE

In [83]:
# PAUSE
tweets_nonAfg.to_csv(os.path.join('archiveData', "tweet_flagged_afg_27b_04.csv"))

In [None]:
# Unpause

# From this notebook's process, this is flagged but not separated (see below):
tweets_nonAfg = pd.read_csv(os.path.join('archiveData', "tweet_flagged_afg_27b_04.csv"), header=0, index_col=0)

# Or from the keep rows of the dataCleaning notebook, separated:
# tweets_nonAfg = pd.read_csv(os.path.join('archiveData', 'tweets_eval_temp.csv'), header=0, index_col=0)


In [None]:
tweets_nonAfg.head()

### Separate the df by flag

In [10]:
# Keep only the rows that were NOT just flagged with "yes" in the nonAfg df
# ONLY RUN the first function ONCE!!
tweets_keep = tweets_nonAfg[tweets_nonAfg["Flag"] == 'yes'].copy() 
tweets_nonAfg = tweets_nonAfg[tweets_nonAfg["Flag"] == 'no'].copy()

# And reset the flag column
tweets_keep.Flag = "no"
tweets_nonAfg.Flag = "no"

In [14]:
print("keep:", tweets_keep.shape)  # 227,728
print("eval:", tweets_nonAfg.shape)
tweets_nonAfg.tail()

keep: (227728, 15)
eval: (156513, 15)


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
384235,2021-01-01 02:55:56+00:00,Gee's Bend was a part of Federal Government's ...,Gee's Bend was a part of Federal Government's ...,no,9,0.07563,,"https://t.co/caOdAb9xOq, https://t.co/UJfSC4S59p",,Alabama,0,1,1,0,No hashtags
384236,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",no,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
384237,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,no,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
384238,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",no,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
384239,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,no,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags


Take a quick peek at the kept rows:

In [12]:
print(aa.term_check("afghanistan", tweets_keep))
print(aa.term_check("afghanistan", tweets_nonAfg))

('afghanistan', 40654)
('afghanistan', 0)


In [92]:
peek = aa.subset_gen(tweets_keep, 25)
aa.labeler(peek, col="ContentClean", lab="ContentLabel", verby=False)

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


y    23
n     1
Name: ContentLabel, dtype: int64

Great. Overwhelmingly relevant.

### Pause/Unpause

In [89]:
# PAUSE
tweets_keep.to_csv(os.path.join('archiveData', "tweet_flagged_keep_27b_04.csv"))
tweets_nonAfg.to_csv(os.path.join('archiveData', "tweet_flagged_nonAfg_27b_04.csv"))

In [14]:
# UNPAUSE
tweets_keep = pd.read_csv(os.path.join('archiveData', "tweet_flagged_keep_27b_04.csv"), header=0, index_col=0)
tweets_nonAfg = pd.read_csv(os.path.join('archiveData', "tweet_flagged_nonAfg_27b_04.csv"), header=0, index_col=0)

print(tweets_keep.shape)
print(tweets_nonAfg.shape)
tweets_nonAfg.tail()

(227728, 15)
(156513, 15)


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
384235,2021-01-01 02:55:56+00:00,Gee's Bend was a part of Federal Government's ...,Gee's Bend was a part of Federal Government's ...,,9,0.07563,,"https://t.co/caOdAb9xOq, https://t.co/UJfSC4S59p",,Alabama,0,1,1,0,No hashtags
384236,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
384237,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
384238,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
384239,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags


### More Kabul?
There are still 190 tweets that have "kab" (this is a partial match, so can't do it through the term checker - flag it instead and look at flag value_counts). I want to look at those.

In [19]:
tweets_nonAfg.Flag = "no"

## This function starts with an input: index box
aa.flag_term("kab", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
print("df shape:", tweets_nonAfg.shape)
print("rows with term", tweets_nonAfg.Flag.value_counts())
#tweets_keep.tail()

no     156323
yes       190
Name: Flag, dtype: int64
df shape: (156513, 15)
rows with term no     156323
yes       190
Name: Flag, dtype: int64


In [21]:
# BUILD a df with just those tweets so we can look at them.

temp_df = tweets_nonAfg[tweets_nonAfg["Flag"] == 'yes'].copy()
print(temp_df.shape)
temp_df.tail()

Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
151569,2021-01-18 21:27:30+00:00,"So honored to work with the Cases, the Huckabe...","So honored to work with the Cases, the Huckabe...",yes,10,0.045249,,"https://t.co/Ta7MTO0dll, https://t.co/nIfbrrWTa0",@arktimes,English-speaking,0,0,0,0,No hashtags
151723,2021-01-18 16:39:49+00:00,Post Edited: Former Gov. Mike Huckabee to rese...,Post Edited: Former Gov. Mike Huckabee to rese...,yes,8,0.083333,,"https://t.co/Lo5AjoAfoD, https://t.co/mTqqY3a9ib",,HeartOfAmerica,0,0,0,0,No hashtags
151726,2021-01-18 16:31:28+00:00,Former Gov. Mike Huckabee to resettle on five-...,Former Gov. Mike Huckabee to resettle on five-...,yes,6,0.068966,,https://t.co/vDvQrAGNWM,@arktimes,maxbrantley@arktimes.com,12,1,7,4,No hashtags
153115,2021-01-13 17:01:41+00:00,It will be recalled that Akufo-Addo in 2019 he...,It will be recalled that Akufo-Addo in 2019 he...,yes,12,0.043321,,,,"Lagos, Nigeria",1,0,0,0,No hashtags
153268,2021-01-13 11:45:55+00:00,Two locations to invest in Abuja for near futu...,Two locations to invest in Abuja for near futu...,yes,4,0.029197,,,,Abuja -Nigeria,0,0,2,0,No hashtags


In [None]:
temp_subset = aa.subset_gen(temp_df, 20, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

The "kab" tweets are not all ir/relevant: 11 / 20 checked (I stopped at that point because the trend was clear) were unrelated and so cannot be easily kicked out.

## Looking at the non-Afg rows for inclusion terms
So, let's move on. Take a look at a subset of the df in the labeler and see what we see...

In [8]:
temp_subset = aa.subset_gen(tweets_nonAfg, 50, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 50 have been created
To end the session, enter 'ESC'


n      32
unk     9
y       8
Name: ContentLabel, dtype: int64

Only 8 of 50 tweets checked were relevant. 32 were not. The last 9 were unclear/unknowable. 

Check the following terms for inclusion criteria:
* "airport" -> 209 instances
* "interpreter", "interpreters" -> 50 and 235 instances
* "evacuation" -> 3,295 instances
* "allies" -> 400 instances
* "parole" -> 1,502 instances
* "SIV" ("special immigrant visa") -> 1,724

Many tweets had to do with Iraqi asylum seekers / refugees. This is an easy delete from the main df.

But the rest were very scattered with only a few on the same topic.

In [18]:
print(aa.term_check("SIV", tweets_nonAfg))

('siv', 1724)


In [None]:
# RESET ONLY
tweets_nonAfg["Flag"] = ""
tweets_nonAfg.tail()

In [31]:
aa.flag_term("airport", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("interpreter", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("translator", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("allies", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" siv ", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" sivs ", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" siv's ", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("special immigrant visa", tweets_nonAfg, clean_col="ContentClean", flag_col="Flag")
tweets_nonAfg.tail()

Row: 60000


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
156508,2021-01-01 02:55:56+00:00,Gee's Bend was a part of Federal Government's ...,Gee's Bend was a part of Federal Government's ...,,9,0.07563,,"https://t.co/caOdAb9xOq, https://t.co/UJfSC4S59p",,Alabama,0,1,1,0,No hashtags
156509,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
156510,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
156511,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
156512,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags


In [32]:
tweets_nonAfg.Flag.value_counts()

       153009
yes      3504
Name: Flag, dtype: int64

In [33]:
new_flags = tweets_nonAfg[tweets_nonAfg["Flag"] == 'yes'].copy()
print(new_flags.shape)
new_flags.tail()

(3504, 15)


Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
153515,2021-01-12 03:38:44+00:00,Chinese-owned firm Siem Reap-Angkor Internatio...,Chinese-owned firm Siem Reap-Angkor Internatio...,yes,12,0.047059,,https://t.co/nUx1Qx4INi,,English-speaking,0,1,1,0,No hashtags
153956,2021-01-10 10:06:29+00:00,@dw2essex @Hammer_doc @butlerrichard2 @VeuveK ...,...,yes,15,0.04298,"NOT, REAL",https://t.co/10C8Z1HptJ,"@dw2essex, @Hammer_doc, @butlerrichard2, @Veuv...","North West, England",1,0,0,0,No hashtags
154902,2021-01-07 08:12:08+00:00,"This mrg 7/01, the #Mai Mai &amp; Allies attac...","This mrg 7/01, the #Mai Mai &amp; Allies attac...",yes,14,0.054902,#FARDC,,"@fatshi13, @MonuscoS, @RFIAfrique.",Kinshasa,0,0,3,0,"['Mai', 'Kangwe', 'Kamombo', 'FARDC']"
155217,2021-01-06 00:54:56+00:00,@jypersian @evanishistory There was pressure o...,There was pressure on all participating Al...,yes,7,0.02583,,,"@jypersian, @evanishistory, @AdamSeipp","Melbourne, Victoria",1,0,1,0,No hashtags
155899,2021-01-03 18:05:31+00:00,@ScottEshom @n1leftbehind @michaelgwaltz An in...,An interpreter shouldn't have to have th...,yes,4,0.022472,SIV,,"@ScottEshom, @n1leftbehind, @michaelgwaltz","Dallas, TX",0,0,1,0,No hashtags


In [34]:
temp_subset = aa.subset_gen(new_flags, 50, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 50 have been created
To end the session, enter 'ESC'


y                40
n - airport       3
n                 2
n - ?             2
n - translate     1
n - allies        1
unk               1
Name: ContentLabel, dtype: int64

So, this had mixed results. For 50 tweets and terms ("airport", "interpreter", "evacuation", "allies", "humanitarian parole", " siv ", "special immigrant visa"):

y                 20 <br>
n - evac          11 <br>
unk               11 <br>
n - hum parole     5 <br>
n - airport        2 <br>
n - ?              1 <br>

So, "evacuation" and "humanitarian parole" or not good safewords. The rest seem ok.

Try this again with a few different terms ("airport", "interpreter", "translator", "allies", " siv ", " sivs ", " siv's ", "special immigrant visa"):

y                40 <br>
n - airport       3<br>
n                 2<br>
n - ?             2<br>
n - translate     1<br>
n - allies        1<br>
unk               1<br>

That's better, but not great. The variations on "SIV" work well. But I think I can take airport out. Also, maybe better to delete some rows first... But all this is over only 3,504 rows. Which hardly makes a dent.

Try excluding first.

# Excluding rows
## First round

In [23]:
tweets_nonAfg.Flag = "no"
#tweets_nonAfg.head()

In [24]:
excluded = ["arkham", "gotham", "batman", "rohinga", "rohingya",
    "syrian", "syria", "tamil", "tigray", "sudan", "sudanese", "somalia", "somali",
    "congo", "congalese", "congolese", "eritrea", "eritrean", "apo", "nigeria", "nigerian",
    "uganda", "ugandan", "rwamwanja", "kirwa", "biloela", "iraq", "iraqi",
    "yemen", "yemeni", "rwanda", "rwandan", "kenya", "kenyan", "kashmir", "kashmiri",
    "palestine", "palestinian", "haiti", "haitian", "tesla", "TSLA", "ukraine", "ukrainian",
    "your patience", "your continued patience", "rental", "tenant", "tenants", "buddhist", "hindu"]

for term in excluded:
    aa.flag_term(term, tweets_nonAfg, clean_col="ContentClean", flag_col="Flag", indx_warning=False, verby=False)

tweets_nonAfg.tail()

no     156508
yes         5
Name: Flag, dtype: int64
no     156504
yes         9
Name: Flag, dtype: int64
no     156503
yes        10
Name: Flag, dtype: int64
no     156500
yes        13
Name: Flag, dtype: int64
no     155985
yes       528
Name: Flag, dtype: int64
no     154208
yes      2305
Name: Flag, dtype: int64
no     153361
yes      3152
Name: Flag, dtype: int64
no     153021
yes      3492
Name: Flag, dtype: int64
no     151906
yes      4607
Name: Flag, dtype: int64
no     151683
yes      4830
Name: Flag, dtype: int64
no     151683
yes      4830
Name: Flag, dtype: int64
no     151511
yes      5002
Name: Flag, dtype: int64
no     151273
yes      5240
Name: Flag, dtype: int64
no     151151
yes      5362
Name: Flag, dtype: int64
no     151151
yes      5362
Name: Flag, dtype: int64
no     151151
yes      5362
Name: Flag, dtype: int64
no     150505
yes      6008
Name: Flag, dtype: int64
no     150505
yes      6008
Name: Flag, dtype: int64
no     147518
yes      8995
Name: Flag, dtype:

Unnamed: 0,Date,Content,ContentClean,Flag,n_CapLetters,CapsRatio,AllCapWords,https,Mentions,Location,ReplyCount,RetweetCount,LikeCount,QuoteCount,Hashtags
156508,2021-01-01 02:55:56+00:00,Gee's Bend was a part of Federal Government's ...,Gee's Bend was a part of Federal Government's ...,no,9,0.07563,,"https://t.co/caOdAb9xOq, https://t.co/UJfSC4S59p",,Alabama,0,1,1,0,No hashtags
156509,2021-01-01 01:55:12+00:00,@StayFierce1973 @Lala43463561 @Raufmustafaye10...,"it is not we who prove this, but the R...",no,3,0.023438,,,"@StayFierce1973, @Lala43463561, @Raufmustafaye...",Toronto - Baku,0,0,1,0,No hashtags
156510,2021-01-01 01:07:04+00:00,20201230: Bryony Lau: Canada now resettles mor...,20201230: Bryony Lau: Canada now resettles mor...,no,8,0.029412,"UN""",https://t.co/UzF1CVFfgV,,Toronto,0,0,0,0,No hashtags
156511,2021-01-01 00:43:07+00:00,"@joemcafield Yep, just spent 40 mins (and coun...","Yep, just spent 40 mins (and counting) tryin...",no,6,0.056075,🙄🤦‍♀️😁\nHNY,,@joemcafield,"Wakefield, England",1,0,0,0,No hashtags
156512,2021-01-01 00:10:26+00:00,@StayFierce1973 @Raufmustafaye10 @alliemark5 @...,excuse me? how do you explain the doc...,no,4,0.022989,,https://t.co/9NvC8mkf6r,"@StayFierce1973, @Raufmustafaye10, @alliemark5...",Toronto - Baku,1,0,0,0,No hashtags


In [26]:
# Keep just the rows that have not been flagged
tweets_nonAfg_excl = tweets_nonAfg[tweets_nonAfg["Flag"] == 'no'].copy()
tweets_nonAfg_excl.shape

(105485, 15)

In [27]:
peek = aa.subset_gen(tweets_nonAfg_excl, 25)
aa.labeler(peek)

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


unk    17
n       5
y       3
Name: ContentLabel, dtype: int64

Overwhelmingly unknowable, but probably relevant. Other than that, split between relevant and not relevant for this project. This is discouraging.

### Check a few more terms
Given the reduced list, now check a few more terms:

"Australia"
"vietnam", "vietnamese",
"cambodia", "cambodian",
"indonesia", "indonesian",
"fire", "flood", "landslide", "cyclone"


In [77]:
# RESET ONLY
tweets_nonAfg_excl["Flag"] = "no"
#tweets_nonAfg.tail()

In [44]:
tweets_nonAfg_excl["Flag"] = "no"
aa.flag_term("landslide", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag")

new_flags = tweets_nonAfg_excl[tweets_nonAfg_excl["Flag"] == 'yes'].copy()
print(new_flags.shape)

no     105371
yes       114
Name: Flag, dtype: int64
(114, 15)


In [45]:
temp_subset = aa.subset_gen(new_flags, 25, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


n     24
nn     1
Name: ContentLabel, dtype: int64

Exclude: "Australia", "vietnam", "cambodia", "indonesia", "fire", "flood", "landslide", "cyclone"

"Australia": most (18 / 25) are unknowable, but I am failrly confident that these are about detained people from Nauru and PNG. In fact, add those to the exclude list. The rest were excludable. 

"vietnam", "vietnamese": most of these rows talk just about the resettlement of Vietnamese after the war; I'm pretty certain that the context of the postings has to do with the situation in Afgh, but that is not provable given my methodology. So, unfortunately, these rows will be excluded.

"cambodia", "cambodian": mostly about either war refugees from 60s/70s or the Australia/Cambodia refugee resettlment deals today. Can be ommitted.

"indonesia", "indonesian": This is mainly pleas by refugees currently detained in Indonesia (or refugee groups) asking to be resetteled elsewhere.
<br> -> found: Hazara: a Persian-speaking ethnic group native to, and primarily residing in the Hazarajat region in central Afghanistan and generally scattered throughout Afghanistan.

"fire", "flood", "landslide", "cyclone": although very occasionally these are used to refer to political events, they are overwhelmingly referring to natural disasters. Can be removed.



In [46]:
flag_list = ["australia", "vietnam", "cambodia", "indonesia", "fire", "flood", "landslide", "cyclone"]

tweets_nonAfg_excl["Flag"] = "no"

for term in flag_list:
    # NOTE: resets the index
    aa.flag_term(term, tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

print(tweets_nonAfg_excl.shape)

no     102042
yes      3443
Name: Flag, dtype: int64
no     101515
yes      3970
Name: Flag, dtype: int64
no     101426
yes      4059
Name: Flag, dtype: int64
no     96004
yes     9481
Name: Flag, dtype: int64
no     95504
yes     9981
Name: Flag, dtype: int64
no     95029
yes    10456
Name: Flag, dtype: int64
no     94926
yes    10559
Name: Flag, dtype: int64
no     94875
yes    10610
Name: Flag, dtype: int64
(105485, 15)


In [47]:
tweets_nonAfg_excl = tweets_nonAfg_excl[tweets_nonAfg_excl["Flag"] == 'no'].copy()
print(tweets_nonAfg_excl.shape)

(94875, 15)


Now let's look at a subset and see how representative.

In [48]:
temp_subset = aa.subset_gen(tweets_nonAfg_excl, 25, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


n      12
unk     9
y       4
Name: ContentLabel, dtype: int64

Half are not relevant; most of the rest are unknowable. So this is pretty... dunno.

Let's try muslim, arab, islam.

In [52]:
tweets_nonAfg_excl["Flag"] = "no"
aa.flag_term("muslim", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("arab", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag", indx_warning=False)
aa.flag_term("islam", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

new_flags = tweets_nonAfg_excl[tweets_nonAfg_excl["Flag"] == 'yes'].copy()
print(new_flags.shape)

new_flags["Flag"] = "no"
aa.flag_term(" siv ", new_flags, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" sivs ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)
aa.flag_term(" siv's ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

new_flags = new_flags[new_flags["Flag"] == 'no'].copy()
print(new_flags.shape)


no     94019
yes      856
Name: Flag, dtype: int64
no     93365
yes     1510
Name: Flag, dtype: int64
no     93154
yes     1721
Name: Flag, dtype: int64
(1721, 15)
no     1716
yes       5
Name: Flag, dtype: int64
no     1715
yes       6
Name: Flag, dtype: int64
no     1715
yes       6
Name: Flag, dtype: int64
(1715, 15)


In [53]:
temp_subset = aa.subset_gen(new_flags, 25, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


n      16
y       6
unk     3
Name: ContentLabel, dtype: int64

No good. Mostly about Armenia, Azerbijan, China. 

Let's take one last look at interpreter and translator.

In [54]:
tweets_nonAfg_excl["Flag"] = "no"
aa.flag_term("interpreter", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag")
aa.flag_term("translator", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

new_flags = tweets_nonAfg_excl[tweets_nonAfg_excl["Flag"] == 'yes'].copy()
print(new_flags.shape)

new_flags["Flag"] = "no"
aa.flag_term(" siv ", new_flags, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" sivs ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)
aa.flag_term(" siv's ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

new_flags = new_flags[new_flags["Flag"] == 'no'].copy()
print(new_flags.shape)


no     94533
yes      342
Name: Flag, dtype: int64
no     94032
yes      843
Name: Flag, dtype: int64
(843, 15)
no     778
yes     65
Name: Flag, dtype: int64
no     770
yes     73
Name: Flag, dtype: int64
no     770
yes     73
Name: Flag, dtype: int64
(770, 15)


In [55]:
temp_subset = aa.subset_gen(new_flags, 25, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


y      21
unk     3
n       1
Name: ContentLabel, dtype: int64

In [56]:
tweets_nonAfg_excl["Flag"] = "no"
aa.flag_term("allies", tweets_nonAfg_excl, clean_col="ContentClean", flag_col="Flag")

new_flags = tweets_nonAfg_excl[tweets_nonAfg_excl["Flag"] == 'yes'].copy()
print(new_flags.shape)

new_flags["Flag"] = "no"
aa.flag_term(" siv ", new_flags, clean_col="ContentClean", flag_col="Flag")
aa.flag_term(" sivs ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)
aa.flag_term(" siv's ", new_flags, clean_col="ContentClean", flag_col="Flag", indx_warning=False)

new_flags = new_flags[new_flags["Flag"] == 'no'].copy()
print(new_flags.shape)


no     94293
yes      582
Name: Flag, dtype: int64
(582, 15)
no     436
yes    146
Name: Flag, dtype: int64
no     433
yes    149
Name: Flag, dtype: int64
no     432
yes    150
Name: Flag, dtype: int64
(432, 15)


In [57]:
temp_subset = aa.subset_gen(new_flags, 25, seed=1080)

# NOTE: This function starts with an input: to reset the index
aa.labeler(temp_subset, col="ContentClean", lab="ContentLabel")

a dataframe and temp_subset_gen.csv of length 25 have been created
To end the session, enter 'ESC'


y     16
n      7
yb     2
Name: ContentLabel, dtype: int64

And, finally, the relevance of the resulting dataset.

"Translator" and "interpreter" look usable.

"Allies" is marginal.

# Summary
This could go on forever. I could look for more exclusion terms ("israel") or more secondary inclusion terms ("humanitarian parole"). I could look for tweets that have 2+ of a wider set of inclusion terms. But all of this is just fighting for a rough handful of rows. I have a sufficient dataset that I am confident consists of mostly relevant rows. There does not appear to be any bias in the rows that are relevant but cannot be systematically included. 

Good enough.