Copyright (c) <2022>, <Regina Nockerts>
All rights reserved.

This source code is licensed under the BSD-style license found in the
LICENSE file in the root directory of this source tree. 


# Twitter scraping with snscrape
Thanks to:  <br>
> Beck, M. (2022, January 5). How to Scrape Tweets With snscrape. Medium. https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af

First, install snscrape and import needed libraries.

In [1]:
# pip install snscrape
import snscrape.modules.twitter as sntwitter
import pandas as pd
import numpy as np
from numpy import random as rand
from nlpUtils import aardvark as aa 
import os.path

# Explore Scraping

## Important note on AND and OR in twitter searches:

Precedence of AND before OR:<br>
<br>
One key difference to keep in mind when using operators with the new recent search and filtered stream endpoints is that<br>
* In the old v1.1 API (search/tweets), OR is applied before logical AND(which is denoted by a space between terms or operators)
* In the new Twitter API (recent search and filtered stream), AND is applied before OR
<br>

See example below:<br>
<br>
Query: corona covid OR covid-19<br>
Interpretation in old standard search endpoint: <br>
* Will return all Tweets with the term corona along with either the term covid or covid-19

Interpretation in new recent search endpoint:<br>
* Will return all Tweets that either contain:
    * both the terms - corona and covid
    * or the term covid-19
<br>

Quoted from: https://developer.twitter.com/en/docs/tutorials/building-high-quality-filters

## First look at what types of data can be collected

Scrape the data to a JSON file via the terminal. <br>
This is not the prefered method as it collects personally identifiable information. 
However, I will use it first to see what the data looks like.

In [None]:
# SAMPLE user queries for the terminal:
snscrape --jsonl --progress --max-results 10 --since 2021-06-01 twitter-search "its the elephant until:2022-07-31" > text-query-tweets.json

And we get the following info types: <br>
_type; 
url; 
date; 
content (emojis are included in the body of the content as unicode characters - ex: \ud83d\udc47); 
renderedContent (not sure how this is different than the "content"); 
id; 
user (username, id, displayname, description, rawDescription, descriptionUrls, verified, created, followersCount, friendsCount, statusesCount, favouritesCount, listedCount, mediaCount, location, protected, linkUrl, linkTcourl, profileImageUrl, profileBannerUrl, label, url); 
replyCount; 
retweetCount; 
likeCount; 
quoteCount; 
conversationId; 
lang; 
source; 
sourceUrl; 
sourceLabel; 
outlinks; 
tcooutlinks; 
media; 
retweetedTweet; 
quotedTweet; 
inReplyToTweetId; 
inReplyToUser; 
mentionedUsers; 
coordinates; 
place; 
hashtags; 
cashtags.
<br><br>

In [12]:
tweets_list = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper("its the elephant until:2022-04-10").get_items()):
    if i > 500:
        break
    tweets_list.append([tweet.date, tweet.content, tweet.renderedContent])

tweets_lower = pd.DataFrame(tweets_list, columns=['Datetime', 'Text', 'Rendered Text'])

## How the search function works
Is the search function capitalization neutral? Looks like it, yes. <br>
Can I search for the same term with OR and get back the same set of results? Yep.

In [113]:
tweets_list = []

# for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(bijltjespad) since:2006-04-01 until:2020-04-02").get_items()):
# for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(Bijltjespad) since:2006-04-01 until:2020-04-02").get_items()):
# for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(BIJLTJESPAD) since:2006-04-01 until:2020-04-02").get_items()):
# for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(BIJLTJESPAD) AND (bijltjespad) since:2006-04-01 until:2020-04-02").get_items()):
# for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(BIJLTJESPAD OR bijltjespad) AND (bijltjespad) since:2006-04-01 until:2020-04-02").get_items()):
for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(BIJLTJESPAD OR bijltjespad) AND (bijltjespad OR Bijltjespad) since:2006-04-01 until:2020-04-02").get_items()):
    if i > 500:
        break
    tweets_list.append([tweet.date, tweet.content])

# tweets_lower = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_lower.shape)

# tweets_cap = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_cap.shape)

# tweets_allcap = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_allcap.shape)

# tweets_allcap_lower = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_allcap_lower.shape)

# tweets_allcap_lower_or = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_allcap_lower_or.shape)

tweets_allcap_lower_or_or = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_allcap_lower_or_or.shape)


(320, 2)
(320, 2)
(320, 2)
(320, 2)
(320, 2)
(320, 2)


Can I use incomplete*?

Yes, it seems so. But this will return a LOT of junk.

In [24]:
tweets_list = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(afghan*) since:2021-09-10 until:2021-09-11").get_items()):
    if i > 500:
        break
    tweets_list.append([tweet.date, tweet.content])

tweets_temp = pd.DataFrame(tweets_list, columns=['Date', 'ContentClean'])
tweets_temp.tail()

Unnamed: 0,Date,ContentClean
496,2021-09-10 23:17:49+00:00,@Jessnj4554 @politvidchannel The Afghan price ...
497,2021-09-10 23:17:39+00:00,"When are people going to #WakeUp\n\nTake That,..."
498,2021-09-10 23:17:36+00:00,Measles Cases Force US To Suspend Afghan Refug...
499,2021-09-10 23:17:29+00:00,Joe Biden is guilty of war crimes for bombing ...
500,2021-09-10 23:17:27+00:00,Afghan special forces commando held after arme...


In [None]:
aa.term_check("Afghan", tweets_temp)

In [27]:
afg_index = []
counter = 0

tweets_temp.reset_index(drop=True, inplace=True)
for i, tweet in enumerate(tweets_temp["ContentClean"]):
    if "afghan " in tweet.lower():
        afg_index.append(i)
        counter +=1

tweets_temp.drop(afg_index, inplace=True)
tweets_temp.reset_index(drop=True, inplace=True)

print(len(afg_index))
print(counter)
print(tweets_temp.shape)
tweets_temp.tail()

0
0
(142, 2)


Unnamed: 0,Date,ContentClean
137,2021-09-10 23:18:40+00:00,Remember a 'double suicide bomb attack' on wai...
138,2021-09-10 23:18:31+00:00,US gives 1st public look inside base housing A...
139,2021-09-10 23:18:28+00:00,So maybe the reason names of the 2 top Afghans...
140,2021-09-10 23:18:20+00:00,.@JoeBiden killed a bunch of innocent Afghans ...
141,2021-09-10 23:17:55+00:00,@DrFeelgood95 Can’t find the words to describe...


In [28]:
look = aa.subset_gen(tweets_temp, 20, seed=1080)
look.insert(loc=2, column='ContentLabel', value="")

# NOTE: This function starts with an input: to reset the index
aa.labeler(look, col="ContentClean", lab="ContentLabel")


## ISSUE: Missed Matches
**WTF is going on with the searh dropping some rows with a term is added with OR?**

In [93]:
tweets_list = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper("(bijltjespad OR kattenburgergracht) AND (bijltjespad OR amsterdam) since:2006-04-01 until:2020-04-02").get_items()):
    if i > 500:
        break
    tweets_list.append([tweet.date, tweet.content])

#tweets_no_kat = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
#print(tweets_no_kat.shape)
tweets_kat = pd.DataFrame(tweets_list, columns=['Datetime', 'Content'])
print(tweets_kat.shape)

(320, 2)
(501, 2)


In [94]:
# NOTE: This functions starts with input: box
# NOTE: This function returns THREE dataframes: superset only AND subset only AND inner/overlap, in that order
kat, no_kat, both = aa.outer_df (tweets_kat, tweets_no_kat)
print(kat.shape)
print(no_kat.shape)
print(both.shape)

0: P 1 BAD-04 (Kleine IBGS) Stank/hind. lucht (gaslucht) (meting) , Bijltjespad Amsterdam 13ASN https://t.co/YpWd61Q6h0
200: Studentenkamer Amsterdam, Bijltjespad: Amsterdam, Bijltjespad, Deze zomer (rond 30 juni t/m 30 augustus, maar... http://t.co/gBWMCW85lq
400: Huisgenoot gratis af te halen aan bijltjespad 30!: Huisgenoot gratis af te halen aan bijltjespad 30! http://tinyurl.com/d8hovh
(185, 4)
(4, 4)
(406, 4)


In [102]:
no_kat.loc[2, "Content"]

'Calamiteit Amsterdam 04:38 - BR 1 BRANDGERUCHT (Incidentnet: reg.inmeld) BIJLTJESPAD 6 ASD [ 531 ]'

In [None]:
#NOTE: This function opens an input: box
aa.labeler(no_kats_df, col="Content", lab="ContentLabel")

## About: missed matches
* "(bijltjespad)" and "(bijltjespad) OR (bijltjespad)" and "(bijltjespad) AND (bijltjespad)" all have the same number of rows: 320
* "(bijltjespad) AND (bijltjespad OR amsterdam)" also has the same number, as expected: 320
* "(bijltjespad OR kattenburgergracht) AND (bijltjespad OR amsterdam)" has extra rows, as expected: 501
    * 320 --> 181 =  extra rows.
* BUT when you find the difference between the sets, it is not that simple...
    * the set that includes "kattenburgergracht" has 185 rows that are not in the smaller set;
        * These look like legitimate exclusions from the smaller set.
    * the set that does not include "kattenburgergracht" has 4 rows that are not in the larger set;
        * These look like they **should** have been included in the larger set:
            * they all should have matched "(bijltjespad) AND (bijltjespad)"
            * and should also have matched "(bijltjespad) AND (amsterdam)" [NOTE: tried this with a larger set and about half should have matched with amsterdam and half not]
    * 185 - 4 = 181 --> the extra rows value.
* **So why does adding a term lose us some relevant rows (even if it adds more)?**
    * It's not the capitalization
    * It's not the since: until:
* **More important, what can I do about this?**
    * Try running these searches through the terminal with this format:<br>
        * snscrape --jsonl --progress --max-results 500000 --since 2021-01-01 twitter-search "its the elephant until:2022-01-01" > text-query-tweets.json
    * Just note this as a shortcoming of the methodology and move on, at elast for now. This is for "future work"

## Content v. Rendered Content?
I'm not sure what the difference is between content and rendered content, so I am going to collect 500 tweets and compare those two fields.

In [None]:
tweets_df["Same"] = np.where(tweets_df["Text"] == tweets_df["Rendered Text"], 1, 0)
print(sum(tweets_df["Same"])) # that's a lot of different tweets, but not all. 

print(tweets_df["Text"][7])  # look at one example closer up.
print(tweets_df["Rendered Text"][7])

tweets_df.head(10)


The difference appears to be how embedded links are treated. As the first version is still a followable link, I will preserve that version of the tweet text.
<br><br>
We will do the actual scraping from the Python wrapper, specifying which fields to save and thus excluding much personally identifiable information.
<br>
So we will want to collect: <br>
* date;
* content (including emojis as unicode characters - ex: \ud83d\udc47); 
* user.followersCount; 
* user.friendsCount; 
* user.location; 
* replyCount; 
* retweetCount; 
* likeCount; 
* quoteCount; 
* lang; 
* retweetedTweet; 
* quotedTweet; 
* coordinates; 
* place; 
* hashtags; 
* cashtags.

## Baseline search
But for now, let's take some small tweet sets and refine search terms.

In [None]:
tweets_list = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper('(Afghanistan OR Afghan OR Afghani OR withdrawal OR war OR resettle OR resettlement) AND ("come here" OR migrant OR immigrant OR refugee OR asylum OR resettle OR resettlement) lang:en since:2021-08-16 until:2021-08-17').get_items()):
    if i > 500000:
        break
    tweets_list.append([tweet.date, tweet.content])

tweets_df = pd.DataFrame(tweets_list, columns=["Date", "Content"])
tweets_df.to_csv(os.path.join('archiveData','hannos_test_tweets.csv'))

In [130]:
tweets_df.to_csv(os.path.join('archiveData','2021-09-01_2022-02-02_FIRSTpass_tweets.csv'))

In [134]:
print(tweets_df.head())
tweets_df.tail()

                       Date                                            Content
0 2021-09-01 23:59:04+00:00  @BaphoNetArchive I'm from Texas originally, an...
1 2021-09-01 23:57:21+00:00  @SharonP92453996 @kelliwardaz @LyndsayMKeith @...
2 2021-09-01 23:56:18+00:00  NEW: Most Afghan evacuees will arrive to the U...
3 2021-09-01 23:52:10+00:00  @Cristiano I am the son of an Afghanistan immi...
4 2021-09-01 23:51:10+00:00  So the left was ok with allowing over 100k ref...


Unnamed: 0,Date,Content
2206,2021-09-01 00:01:51+00:00,Efforts to resettle people here continue. Ange...
2207,2021-09-01 00:01:43+00:00,One question that I’ll be mulling over in the ...
2208,2021-09-01 00:00:58+00:00,Native people from Afghanistan lawfully reside...
2209,2021-09-01 00:00:25+00:00,"People who wanted to sanction Pakistan, now wa..."
2210,2021-09-01 00:00:22+00:00,Franklin County commissioners adopted a resolu...


So, yeah, one day in the middle of August, before Kabul fell, there were about 2,000+ tweets. Is this a lot? It seems fine, but somewhat less than I expected. Enough to do interesting things with. But I am concerned that I am missing most of the twitter conversation that I am interested in, which limits my policy relevance.

# Finding Search Terms
So, this is actually quite hard. 

I started with a set of terms from my own knowledge, background literature on refugees and twitter sentiment, and major American news sources (Washington post, Vox, Axios). I added terms found via right-leaning news sources (National Review, Fox News, Washingtonian Free Beacon, and The American Thinker). And then reviewed some left-leaning sites (Slate, NPR, Mother Jones, CommonDreams) to ensure that a fair cross section of language was referenced when selecting terms.

__________
I can also try reviewing the poll questions.
____________

I want to add terms to my base set such that: 
* the terms add **relevant** tweets to the dataset
* the terms do not add irrelevant tweets to the dataset
    * which requires comparing the difference between the new (superset) and old (subset) searches and looking at randomly selected tweets
* the dataset ends up broadly representative of the true stance/sentiment expressed on Twitter
    * Which is hard, because I do not know what the true values are.

I can stop adding terms when adding a new term: 
* does not add more rows to the dataset ~ irrelevant, my existing terms do the same work, or
* adds a large percentage of irrelevant rows to the dataset ~ counterproductive.


Now we can look at some potential sets of terms.

## Search Log
"refugee OR migrant OR immigrant OR asylum OR afghanistan OR afghan OR afghani since:2021-04-20 until:2021-05-01" <br>
Plus the same terms for 04-14 to 04-20 <br>
Looking at the tweets from April 2021, they seem to be all over the place. There are no clear trends. And relatively few of them are actually about the idea of increased movement of Afghani nationals. Perhaps it is too early in the timeline - people have not yet recognized the scale of the need nor the impossibly short timeline?
______
After moving to mid-August (since:2021-08-15 until:2021-08-20), there are far more tweets - I hit max iter (500,000, 273 minutes) after ~1.5 days.
<br><br>
Now I'm curious about a different search term for April: "withdrawal." This definitely got more results (19,157), but more totally irrelevnat ones and more related to troops but not refugees. 
<br><br>
Maybe "evacuation OR allies OR resettle OR resettlement"? Well, "allies" gets too many irrelevant hits (gender and race related). And "evacuation" is almost all natural disaster-related.
____
"(Afghanistan OR Afghan OR Afghani OR withdrawal OR war) AND (migrant OR immigrant OR refugee OR resettle OR resettlement) lang:en since:2021-04-14 until:2021-04-17" <br>
This is getting pretty close. I'm going to try it for one day in mid-August. Nope, too specific, I think. Needs to broaden.<br>
<br>
11 April 2022<br>
'(Afghanistan OR Afghan OR Afghani OR withdrawal OR war OR resettle OR resettlement) AND ("come here" OR migrant OR immigrant OR refugee OR asylum OR resettle OR resettlement) lang:en since:2021-08-19 until:2021-08-20'<br>
<br>
This is starting to look pretty good. Average tweets per day are lower than expected, though. For context, in mid August 2021, the duling hashtags #IStandWithBiden and #BidenDisaster were trending with ~50,000 tweets per day. On the same day (Aug. 16 to 17), 
In total for the year, my search returns 267,344 (275,588 without lang:en).
___
Next search: 
(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR war OR "Afghan evacuation" OR resettle OR resettlement OR resettled) AND 
("come here" OR migrant OR immigrant OR refugee OR asylum OR vetted OR vetting OR unvetted OR "without identification" OR "lack identification" OR resettle OR resettlement OR resettled) 
lang:en since:2021-01-01 until:2022-01-01'
<br><br>
Search returns: tweet_data_14_04, 319,915 rows (increase of 52,571)<br><br>
This was an improvement. I reviewed a random subset of the new rows and they look mostly relevant and new; they tend towards the anti-refugee side, as expected, though there are quite a few pro.

___
(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR evacuation OR resettle OR resettlement OR resettled OR "humanitarian parole") AND (migrant OR immigrant OR refugee OR asylum OR vetted OR vetting OR unvetted OR "without identification" OR "lack identification" OR "lacking identification" OR resettle OR resettlement OR resettled OR "humanitarian parole") lang:en since:2021-01-01 until:2022-01-01
<br><br>
search 'tweet_data_16_04.csv' returns 270,015 rows<br>
Consider putting "come here" back into the second search term.

___
(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR evacuation OR resettle OR resettlement OR resettled OR "humanitarian parole") AND ("come here" OR migrant OR immigrant OR refugee OR asylum OR vetted OR vetting OR unvetted OR "without identification" OR "lack identification" OR "lacking identification" OR resettle OR resettlement OR resettled OR "humanitarian parole") lang:en since:2021-01-01 until:2022-01-01

Search 'tweet_data_17_04.csv' returns 273,721 rows
<br><br>
Compared to search tweet_data_16_04: 
* in 16_04 and not 17_04: 1122 rows: Looked at 100 tweets from the difference: 10 were irrelevant, 40 were con, 28 were pro.
    * I'm not sure why this isn't empty. The second search only added terms, it did not remove any, so it should only have added matches...
* in 17_04 and not 16_04: 4832 rows: mostly relevant; mostly con, some pro.


Search 'tweet_data_17b_04.csv' returns 274,425 rows
* adds "to vet"
* 205 more rows than Search 'tweet_data_17_04.csv' <br>
--> Keep it in.

___
Search "tweet_data_18_04.csv" returns 274,281 rows:
* add: "Afghan allies" (1st term)
* Compared to 17b_04:
    * adds 154 mainly relevant rows (18/25 checked)
    * But also kicks out 289 relevant rows (19/25 checked)
    * the 'superset' has 144 FEWER rows <br>
--> Do not keep


Search "tweet_data_18b_04.csv" returns 300,446 rows:
* remove: "Afghan allies"
* add: "women and girls" (2nd term)
* Compared to "tweet_data_17b_04.csv":
    * Adds 26,554 rows, mainly irrelevant (38 / 50 checked)
        * this is mainly general expressions of pity, but little about resettlement specifically
    * Drops 525 rows, mainly relevant (20 / 25 checked) <br>
--> Do not keep

_________

Search "tweet_data_19_04.csv" returns 290,659 rows:
* remove: "women and girls"
* add: "SIV" OR "Special Immigrant Visa" (2nd term)
* Compared to: "tweet_data_17b_04.csv":
    * Adds 16,761 rows, mainly relevant (49 / 50 checked)
    * Drops 527 rows, mainly relevant (20 / 25 checked)<br>
--> Keep

Search "tweet_data_19b_04.csv" returns 293,865 rows:
* add: relocation (2nd term)
* Compared to "tweet_data_19_04.csv":
    * adds 3,507 rows, mostly relevant (41 / 50 checked)
    * drops 301 rows, mostly relevant (19 / 25 checked) <br>
--> keep

_________
Search "tweet_data_25_04.csv" returns       rows:
* add: "afgan", "afgani"
* remove: "come here"
* Compare to "tweet_data_19b_04.csv":
    * total: 3842 fewer rows
    * adds 1990 rows, mixed relevance
    * drops 5818 rows, mostly relevant
--> do not keep


Search "tweet_data_25b_04.csv" returns       rows:
* add: "come here"
* remove: "afgani"
* Compare to "tweet_data_19b_04.csv":
    * total: + 511 rows
    * adds 1935 rows, 39 / 50 relevant (78%)
    * drops 1409 rows, 18 / 20 relevant (90%) <br>
--> Do not keep. Back to tweet_data_19b_04.


_________
Search "tweet_data_26_04.csv" returns 371,751 rows:
* EXTEND scrape until: 2022-04-02
    * to see if trend returns to baseline
* Compare to "tweet_data_19b_04.csv":
    * adds 79,265 rows, 28 / 50 relevant
    * drops 2007 rows, 19 / 25 relevant
-> this adds a lot of irrelevant rows, as expected. Still useful for determiing the baseline.

___________

Search "tweet_data_27_04.csv" returns 303,846 rows (9981 more rows):
* NOT extended scrape, until: 2022-01-01
* add: translators
* remove: "come here"
* Compare to "tweet_data_19b_04.csv":
    * add: 16,869 rows, 49 / 50 relevant
    * remove: 6,872 rows, 17 / 24 relevant
--> keep



Search "tweet_data_27b_04.csv" returns 384,555 rows (90,690 more rows):
* EXTEND scrape, until: 2022-05-01
* Compare to "tweet_data_27_04.csv":


* Compare to "tweet_data_19b_04.csv":
    * add: 96,968 rows, 35 / 50 relevant (70%)
    * remove: 6,912 rows, 18 / 25 relevant (~72%)

--> USE THIS ONE
______

Yet to be considered: 
* "Afghan translator/s"
* influx
* "refugee crisis"
* "screening process"
* "foreign nationals"
* airlift / airlifted

POSSIBLE: afgha*
* this will return a lot of junk, as well as more hits. Where is the tradeoff?


Rejected:
* "afghan allies"
* "women and girls"
* "afgan", "afgani"
* "come here" NOTE: this is a good term, but I'm out of space in the search...

Other terms that were considered: 
* visa
* crisis
* humanitarian
* war

In [None]:
# INCLUDES UNUSED COLUMNS
# tweets_list = []

# for i, tweet in enumerate(sntwitter.TwitterSearchScraper('(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR evacuation OR resettle OR resettlement OR resettled OR "humanitarian parole") AND (migrant OR immigrant OR refugee OR asylum OR vetted OR vetting OR unvetted OR "without identification" OR "lack identification" OR "lacking identification" OR resettle OR resettlement OR resettled OR "humanitarian parole") lang:en since:2021-01-01 until:2022-01-01').get_items()):
#     if i > 500000:
#         break
#     if tweet.content.startswith("rt @"):
#         continue
#     else:
#         tweets_list.append([tweet.date, tweet.content, tweet.user.followersCount, tweet.user.friendsCount, tweet.user.location, \
#             tweet.replyCount, tweet.retweetCount, tweet.likeCount, tweet.quoteCount, tweet.lang, tweet.retweetedTweet, \
#                 tweet.quotedTweet, tweet.coordinates, tweet.place, tweet.hashtags])

# tweets_df = pd.DataFrame(tweets_list, columns=["Date", "Content", "FollowersCount", "FriendsCount", "Location", "ReplyCount", "RetweetCount", "LikeCount", \
#     "QuoteCount", "Lang", "RetweetedTweet", "QuotedTweet", "Coordinates", "Place", "Hashtags"])

# tweets_df.to_csv(os.path.join('archiveData','temp_full.csv'))
# tweets_df.tail()

NOTE: I think there is a max number of terms and I have reached it...
* replaced "come here" with "Afgani"

In [None]:
tweets_list = []

for i, tweet in enumerate(sntwitter.TwitterSearchScraper('(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR evacuation OR resettle OR resettlement OR resettled OR "humanitarian parole") AND (translators OR migrant OR immigrant OR refugee OR asylum OR "SIV" OR "Special Immigrant Visa" OR vetted OR vetting OR unvetted OR "to vet" OR "without identification" OR "lack identification" OR "lacking identification" OR relocation OR resettle OR resettlement OR resettled OR "humanitarian parole") lang:en since:2021-01-01 until:2022-05-01').get_items()):
    if i > 500000:
        break
    if i % 10000 == 0:
        print("row:", i)

    if tweet.content.startswith("rt @"):
        continue
    else:
        tweets_list.append([tweet.date, tweet.content, tweet.user.location, \
            tweet.replyCount, tweet.retweetCount, tweet.likeCount, tweet.quoteCount, tweet.hashtags])

tweets_df = pd.DataFrame(tweets_list, columns=["Date", "Content", "Location", "ReplyCount", "RetweetCount", "LikeCount", \
    "QuoteCount", "Hashtags"])
# temp removed columns: , "FollowersCount", "FriendsCount"  //  tweet.user.followersCount, tweet.user.friendsCount, 
# Removed columns: "Lang", "RetweetedTweet", "QuotedTweet", "Coordinates", "Place"

tweets_df.to_csv(os.path.join('archiveData','temp_full.csv'))
tweets_df.tail()
print(tweets_df.shape)

In [55]:
# --------- EXPORT single search -----------
#tweets_df = tweets_df.drop_duplicates().reset_index(drop=True)
tweets_df.to_csv(os.path.join('archiveData','tweet_data_27b_04.csv'))

# --------- phased search -----------
# tweets_df.to_csv(os.path.join('archiveData','2021-09-01_2022-01-01_FIRSTpass_tweets.csv'))
# tweets_df.to_csv(os.path.join('archiveData','2021-05-01_2021-09-01_FIRSTpass_tweets.csv'))
# tweets_df.to_csv(os.path.join('archiveData','2021-01-01_2021-05-01_FIRSTpass_tweets.csv'))

In [None]:
# --------- IMPORT single search -----------
tweets_df = pd.read_csv(os.path.join('archiveData',"tweet_data_27b_04.csv"), header=0, index_col=0)
print(tweets_df.info(show_counts=True))

# --------- phased search -----------
# create a large  dataframe with the full year's worth of tweets
# tweets_data1 = pd.read_csv(os.path.join('archiveData',"2021-01-01_2021-05-01_FIRSTpass_tweets.csv"), header=0, index_col=0)
# tweets_data2 = pd.read_csv(os.path.join('archiveData',"2021-05-01_2021-09-01_FIRSTpass_tweets.csv"), header=0, index_col=0)
# tweets_data3 = pd.read_csv(os.path.join('archiveData',"2021-09-01_2022-02-02_FIRSTpass_tweets.csv"), header=0, index_col=0)

# frames = [tweets_data3, tweets_data2, tweets_data1]
# tweets_data = pd.concat(frames, ignore_index=True)
# tweets_data = tweets_data.drop_duplicates().reset_index(drop=True)
# tweets_data.to_csv(os.path.join('archiveData',"tweet_data.csv"))
# print(tweets_df.info(show_counts=True))

### Examining the difference a term makes

In [None]:
#sub = pd.read_csv(os.path.join('archiveData',"tweet_data_clean.csv"), header=0, index_col=0)

# NOTE: This function starts with an input: box
diff_14_04 = aa.outer_df (tweets_df, sub, silent="no")


In [35]:
print(tweets_df.shape)
my_a = aa.subset_gen(tweets_df, 100)


(319916, 15)
a dataframe and .csv of length 100 have been created


In [65]:
big = pd.read_csv(os.path.join('data',"tweet_data_27_04.csv"), header=0, index_col=0)
small =  pd.read_csv(os.path.join('archiveData',"tweet_data_19b_04.csv"), header=0, index_col=0)

# NOTE: This functions starts with input: box
# NOTE: This function returns THREE dataframes: superset only AND subset only AND inner/overlap, in that order
big_d, small_d, both = aa.outer_df (big, small, silent="yes")

print("superset:", big.shape)
print("subset", small.shape)
print("the superset has", big.shape[0] - small.shape[0], "more rows.")
print()
print("just in superset:", big_d.shape)
print("just in subset:", small_d.shape)
#print("in both super- and subset:", both.shape)

superset: (303846, 8)
subset (293865, 10)
the superset has 9981 more rows.

just in superset: (16869, 18)
just in subset: (6872, 18)


In [66]:
my_big_d = aa.subset_gen(big_d, 50)
my_small_d = aa.subset_gen(small_d, 25)

a dataframe and temp_subset_gen.csv of length 50 have been created
a dataframe and temp_subset_gen.csv of length 25 have been created


In [None]:
#NOTE: This function opens an input: box
aa.labeler(my_big_d, col="Content", lab="ContentLabel")
aa.labeler(my_small_d, col="Content", lab="ContentLabel")

In [68]:
print(my_big_d["ContentLabel"].value_counts())
print(my_small_d["ContentLabel"].value_counts())

y    49
n     1
Name: ContentLabel, dtype: int64
y    17
n     7
Name: ContentLabel, dtype: int64


# Save final data set
'tweet_data_27b_04.csv' is the final data set. Save it to data folder.

In [None]:
tweets_df.to_csv(os.path.join('data','tweet_data_27b_04.csv'))

sntwitter.TwitterSearchScraper( <br>
    '(Afghanistan OR Afghan OR Afghans OR Afghani OR withdrawal OR evacuation OR resettle OR resettlement OR resettled OR "humanitarian parole") <br>
    AND <br>
    (translators OR migrant OR immigrant OR refugee OR asylum OR "SIV" OR "Special Immigrant Visa" OR vetted OR vetting OR unvetted OR "to vet" OR "without identification" OR "lack identification" OR "lacking identification" OR relocation OR resettle OR resettlement OR resettled OR "humanitarian parole") <br>
    lang:en since:2021-01-01 until:2022-05-01').get_items())<br>



So far:
* Search term: (Afghanistan OR Afghan OR Afghani OR withdrawal OR war OR resettle OR resettlement) AND ("come here" OR migrant OR immigrant OR refugee OR asylum OR resettle OR resettlement)
* Do include lang:en in the search
* Remove: -Gotham (4 rows) -Arkham (292 rows)
* check for and skip "rt @..."