# Exercise W7D1: Review and Putting it All Together

This exercise aims to draw together the topics we have covered in the _Base Camp_ portion of the Digital Methods class. At the end of the exercise, you should have a `DataFrame` with each row containing information on a Twitter account including their tweets, friends, followers, hashtags and mentions as well as some descriptive statistics.

You will be able to reuse and modify this code for the second half of digital methods to download and analyze tweets for your projects. So, this exercise should provide you with a solid review of different things we have learned and help you for the the rest of the course.

**Exercise 1. Identify a topic, authenticate, and get data.** First, identify a topic of interest to you and think about a keyword or hashtag capturing the topic. Possible topics could be Corona or Climate, but you are welcome to choose something else. Then load the `tweepy` module and use the built-in functionality to `search` Twitter for your keyword or hashtag. Create a variable that contains the data returned by your search.

See [here](http://docs.tweepy.org/en/latest/api.html#help-methods) for more information about the `search` method.

In [1]:
import sys
! conda install --yes --prefix {sys.prefix} -c conda-forge tweepy

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
import tweepy
print(tweepy.__version__)

3.8.0


In [3]:
# set up authentication for twitter
# first, use tweep OAuthHandler function to generate authentification object
from AppCred import CONSUMER_KEY, CONSUMER_SECRET
from AppCred import ACCESS_TOKEN, ACCESS_TOKEN_SECRET

In [4]:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

In [5]:
# second, add access token and secret to authentification object
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

In [6]:
# finally, generate variable to interact with twitter API
api = tweepy.API(auth)

In [7]:
election2020_data = api.search(q="#Election2020", count=100)
print(election2020_data)

[Status(_api=<tweepy.api.API object at 0x10b4a3310>, _json={'created_at': 'Fri Mar 20 08:12:16 +0000 2020', 'id': 1240914037971419136, 'id_str': '1240914037971419136', 'text': 'RT @IrvParrett: #ElizabethWarren is a hypocrite and might want to discuss her motives for pretending to be a Native American to advance her…', 'truncated': False, 'entities': {'hashtags': [{'text': 'ElizabethWarren', 'indices': [16, 32]}], 'symbols': [], 'user_mentions': [{'screen_name': 'IrvParrett', 'name': 'Irvin Parrett', 'id': 2491308421, 'id_str': '2491308421', 'indices': [3, 14]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 2491308421, 'id_str': '2491308421', 'name': 'Irvin Parrett', 'screen_name': 'IrvParrett', '

Now we have an object containing a number of tweets pertaining to our topic of interest. As you might remember, by default the Twitter API returns the data to us in JSON format. Now that we know about the elegance and beauty of `DataFrames`, we would prefer to work with that format of data rather than a dictionary-style JSON. 

**Exercise 2. Turn raw API data into a DataFrame.** Your search returned a set of tweets about your chosen topic. Construct a `DataFrame` from your Twitter search object of the people who are tweeting about that topic that, at minimum, contains the unique `screen_names`, `followers_count`, `friends_count`, and `statuses_count` returned from your search. 

There are a number of ways to do this so you might want to review how to construct `DataFrames` (W6D1-Demo). You may also want to review navigating JSON objects (W4D2-Exercise_solutions). Also, your returned data might include the same account multiple times, so you will want to make sure that you are listing the account only once in your `DataFrame`.

In [15]:
import pandas as pd
import numpy as np

In [16]:
names = []
screen_names = []
id_str = []
location = []
description = []
followers_count = []
friends_count = []
statuses_count = []
created_at = []


for tweet in election2020_data:
    if (tweet._json['user']['screen_name'] not in screen_names):
        names.append(tweet._json['user']['name'])
        screen_names.append(tweet._json['user']['screen_name'])
        id_str.append(tweet._json['user']['id'])
        location.append(tweet._json['user']['location'])
        description.append(tweet._json['user']['description'])
        followers_count.append(tweet._json['user']['followers_count'])
        friends_count.append(tweet._json['user']['friends_count'])
        statuses_count.append(tweet._json['user']['statuses_count'])
        created_at.append(tweet._json['user']['created_at'])

In [20]:
election2020_dictionary = {
    "name": names,
    "screen_name": screen_names,
    "id_str": id_str,
    "location": location,
    "description": description,
    "followers_count": followers_count,
    "friends_count": friends_count,
    "statuses_count": statuses_count,
    "created_at": created_at
}

election2020_df = pd.DataFrame(election2020_dictionary)
election2020_df

Unnamed: 0,name,screen_name,id_str,location,description,followers_count,friends_count,statuses_count,created_at
0,📦The Unboxers📦,TheUnboxerrs24,3008049196,United States,"Bringing you the best in toys, games & family ...",4810,1789,45260,Sat Jan 31 17:15:42 +0000 2015
1,Cheri Head,HempChica,1235638306211151875,,"Entrepreneur, CBD advocate, natural health ent...",2,14,4,Thu Mar 05 18:48:40 +0000 2020
2,Gabrielle- H◾️gh ◾️◾️i◾️st◾️◾️◾️ of Redaction,Minervasbard,851205216934522880,,Here To bear witness & record this dark time i...,12748,6287,184058,Sun Apr 09 22:48:33 +0000 2017
3,STEMthebleeding,STEMthebleeding,1227999980654407680,,Medically Retired Marine. Biochemist and math...,138,178,2935,Thu Feb 13 16:56:57 +0000 2020
4,Buddy'sRNLResists!,RuthBuddy1,985972898925039618,"Pennsylvania, USA",#JusticeIsComing #DontLookAway #PAResists #End...,21599,21996,80885,Mon Apr 16 20:07:12 +0000 2018
...,...,...,...,...,...,...,...,...,...
81,David G Zeiler ⚡️,DavidGZeiler,2206472570,"Baltimore, MD","Staff #cryptocurrency expert @MoneyMorning, a ...",3939,2655,8999,Thu Nov 21 05:28:50 +0000 2013
82,Jessica Mahuron,jessica_mahuron,986301147366371328,"Coeur d’Alene, ID",,38,7,19,Tue Apr 17 17:51:32 +0000 2018
83,Nickey,nickey1969,1101841122,"Texas, USA",Married to my life partner of 20 years. 🌈 We h...,3578,4996,187042,Fri Jan 18 19:51:04 +0000 2013
84,Uncle Shamsbros,Merican_Muse,3822351699,United States,U.S. Army Veteran/Chemical Corps. Political = ...,1818,1088,9657,Wed Sep 30 08:11:56 +0000 2015


With our neat `DataFrame` we can now easily find out details about the data we collected from Twitter.

**Exercise 3. Get information about our data.** Use the `print` function and string operations to make Python tell you in plain language: a) How many unique accounts there are in your data, b) what the name of the _last_ account in your data is, and c) what the sum of followers is for all accounts in your data. That is make Python print out full sentences with the relevant information.

In [24]:
print("a) There are " + str(election2020_df.name.count()) + " unique accounts in the data.")
print("b) The last account in my data is called " + str(election2020_df.name[12]))
print("c) The accounts in my data have a total of " + str(sum(election2020_df.followers_count)) + " followers.")

a) There are 86 unique accounts in the data.
b) The last account in my data is called Dasan Buddhika
c) The accounts in my data have a total of 636566 followers.


**Exercise 4. Adding data to our DataFrame.** Loop through the indices of your `DataFrame`, collect the timeline for each account using the `user_timeline` method from tweepy, and store them in a new list "timelines". Note that you will want to build in some `sleep` time to avoid running into rate limits. You can find the syntax for how to do this on page 155 and the logic and examples on pages 209-12 in Brooker (2020).

In [26]:
timelines = []

for item in election2020_df.index:
    statuses = api.user_timeline(election2020_df.screen_name[item])
    timelines.append(statuses)
    time.sleep(5)
    print(item)

NameError: name 'time' is not defined

Add your list "timelines" to your current `DataFame`. To do this, we first need to turn our list into a new `DateFrame` with one column labeled `timelines` and then join our two `DateFrames` horizontally, i.e. along `axis = 1`.

### Take a deep breath. This was a major piece of coding. 

Now you have the timeline, that is the statuses, for each of your accounts in the `DataFrame`. But these are still in the raw format which the Twitter API returns, so we need to transform them into a format that allows us to to work with them more easily. In the end, we want to get at the the topics and persons our accounts tweet about.

**Exercise 5. Getting the tweet texts from the timeline.** Extract the text from the tweets in each account's timeline, combine them into a list, turn the list of lists into a `DataFrame`, and join the new and old `DataFrames`. One way to do this is to 1) create an empty list 'tweets', 2) loop through the indices in your `DataFrame`, 3) for each index/row loop through the timeline, 4) create a temporary list, append the text for each timeline element to that list, then append the temporary list to 'tweets' 5) turn the list into a `DataFrame` and 6) merge the two `DataFrames` horizontally.

**Exercise 6. Turning our list of tweet texts into a long string.** To get a sense of what our accounts usually tweet about, it might be useful to have their tweets in one long string that allows us to easily count the words they use. Create a list that holds the long string of tweets for each user. We can concatenate our list of tweets/strings using the [join](https://docs.python.org/2/library/string.html#string.join) command for which you can find a usage example [here](https://stackoverflow.com/a/493842).

Turn the list into a `DataFrame` and merge it with your main `DataFrame` horizontally. 

**Exercise 7. Finding hashtags and mentions.** Now that we have all the tweets for each account in one long string, we can start looking at the topics the accounts are tweeting about and who they are interacting with. To do so, you can use the [`findall`](https://docs.python.org/3/library/re.html#re.findall) function from the `re` package to extract all hashtags (starting with a "#") and mentions (starting with an "@"). Add one column for hashtags and mentions respectively to your `DataFrame`.

**Exercise 8. Writing your insights to a file.** You have just generated some really awesome insights about the accounts you identified earlier. To share your insights, that is the topics/hashtags your accounts tweet about, you should now write the hashtags to a text file–if you want to remind yourself, we covered this in week 3 day 1. Can you make it so the text file first lists the name and then the hashtags the account uses?

**Exercise 9. Descriptive statistics about your accounts.** We closed last week with talking about descriptive statistics. For the accounts you gathered, there are at least three variables that you might be interested to know more about. What are the minimum, maximum, and mean for the number of followers, friends, and posted statuses in you data?

**Exercise 10. Visualizing influence.** To round off this exercise, let's plot some data from the accounts you collected. Make a bar plot to show which of the accounts has the most influence on Twitter. _Hint:_ You might want to look at `followers_count`.

**Exercise 11. Understanding influence.** Now that you know who is most influential among your accounts, try to see if the data you get from Twitter allows you to explore what might explain that influence. Look into your data and plot the follower count against another variable. Is there a pattern?

**THERE IS ALWAYS MORE.** If you got all the way through this exercise and are still hungry for more, here are some suggestions for other things you could do:

1. To get an even better sense of what your accounts tweet about than just using hashtags, you could count the most used words. Create a list that, for each account has a dictionary of the frequency of each word with stop words removed. Remember, you can reuse your code from W3D1. You can get a list of stop words from [here](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words). These are also stored in `stop_words.txt`. Add a column to your dataframe for most used words. 
2. Extract the number of favorites and retweets from the timelines you gathered. Is there any relationship between the number of followers and these figures? How about between these figures and the number of friends?
3. Researchers often use Twitter because we can do respondent-driven sampling, i.e. we start with a few accounts and then collect the accounts that follow these accounts to get a broader picture of the network. Start exploring the networks of the accounts you collected using the [`followers`](https://tweepy.readthedocs.io/en/latest/api.html#API.followers) command.
4. Given that the accounts you collected are similar in that they tweet about your topic of choice, it might be interesting know if there are issues that distinguish the accounts. Researchers often use term frequency-inverse document frequency to study such differences. [Here](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/) is a primer on the concept and a tutorial on how to implement it in Python. Can you find distinguishes your accounts from one another?