# Data Analysis for Social Media: A Primer on the Twitter API
**Vagelos Computational Science Center, Barnard College (October 7, 2022)**

Instructor: Jack LaViolette (he/him), PhD student, Columbia Sociology and the Interdisciplinary Center for Innovative Theory and Empirics
- jl5770@columbia.edu
- @jack_laviolette

The first part of this workshop will acquaint you with the basic building blocks of coding and manipulating data in Python:
- **Python fundamentals**
    - strings, integers, floats, Booleans, lists, dictionaries, functions
- **Working with tabular data**
    - the ``pandas`` library and ``DataFrame`` class

Once we have a bit of familiarity exploring data in Python, in the second part we will apply these skills to Twitter data that you will extract yourself:
- **Accessing data from Twitter**
    - Twitter API + ``tweepy``
- **Visualizing data**
    - Scatterplots, histograms, and line plots with ``matplotlib.pyplot``
- **Text analysis**
    - Sentiment analysis and semantic similarity with ``spacy`` and ``word2vec``

There is a **lot** of material here for a two-hour beginner workshop, so please don't worry if you feel overwhelmed! My hope is that you leave feeling that these are skills you could imagine developing further on your own. I chose the Twitter API instead of a static dataset because you can repurpose this code notebook to explore your own questions related to social media data.

## Before we get started...

Run the code cell below to make sure the rest of the notebook will work. 

Once it finishes running, click the "Runtime" tab at the top, and click "Restart runtime".

In [None]:
! pip install -U
! pip install spacy==3.0.6
! pip install spacytextblob==4.0.0
! pip install -U gensim
! pip install pyldavis
! python -m textblob.download_corpora
! python -m spacy download en_core_web_sm

## Part 1a: a very brief introduction to Python basics

There are four fundamental **data types** in Python. These are:

- **Strings**
- **Integers**
- **Floats**
- **Booleans**

There are many other, more complex types of data structures, some of which we will encounter later today, but these are the building blocks. 

You can store these values by assigning them to a **variable** using the ``=`` sign. Variable names cannot contain spaces and cannot start with a number, but can otherwise be anything you want (in general, short and descriptive is good).

``variable_name = variable_value``

### Strings

**"Strings"** are Unicode characters: basically, **text data**. In Python, you denote a string with **quotation marks**. You can use single or double quotes, but you have to be consistent for a given string (i.e., if you start with a single quote, you must end with a single quote):

In [None]:
# define two strings, one with double quotes and one with single quotes. Then print them.

You can also "add" strings together with the ``+`` sign; this is called "concatenation":

In [None]:
# define two strings and concatenate them

Python strings come with all sorts of **built-in methods** for modifying them. We won't cover many of these, but some simple examples here:

In [None]:
# lowercases the string


# uppercases the string


# replaces any instance of 'cat' to 'dog'


### Integers

**Integers** are intuitive to use. You can do arithmetic with them as you would expect:

In [2]:
# let's do some math

### Floats

**Floats** behave the same way. You can perform arithmetic mixing **floats and integers**.

In [None]:
# combine floats and integers in a single expression

### Booleans

**Boolean** variables have values of either **True** or **False** (capitalization required):

In [None]:
# Define two booleans

These can be helpful for filtering data, as we will see later. The **identity operator** ``==`` can be read as "is equal to", and will return a Boolean. The ``=!`` operator reads "is not equal to" and functions similarly. An example below:

In [None]:
# Evaluate whether two variables are equal to one another

You can also use the ``>`` and ``<`` signs to compare values:

In [None]:
# Evaluate whether two variables are GT/LT to one another

### Lists

Very often, you'll want to **store multiple values in a list**. Lists can contain different types of values (you can mix strings, integers, etc., along with other more complicated data types.

You can create a list with **square brackets**, with elements separated by commas: ``[item1, item2, item3]``



In [None]:
# create a list

To access a specific item in the list, you can reference it by its position in the list (its **'index'**). You do this with a square bracket containing the index number to the right of the list.

**IMPORTANT**: Python is **'zero-indexed,'** which means that the first item in a list is referred to by the index 0 rather than 1.

In [None]:
# access elements of your list with indices

You can **add new elements** to a list with the ``append`` function:

In [None]:
# append an element to your list

You can also **combine** lists with the ``+`` operator:

In [None]:
# combine two lists

Lists can even contain other lists!

In [None]:
# create a list of lists

### For loops

When working with lists or other sequences of values, you'll often want to **iterate** over them and perform some **operation** on each element of the list. A very common way of achieving this is called a **for loop**:

In [None]:
# write a for loop that does something to each element of a list


### Dictionaries

Finally, you'll often want to store **pairs of values** (for example, user IDs and passwords). The simplest and most common way to achieve this is with a **dictionary**. The two elements of the pair are called the **"key" and the "value"**. Dictionaries are denoted by **curly brackets**: { }

Within the curly brackets, you can set key value pairs like this:

``{key1:value1, key2:value2, key3:value3}``

You then pass the dictionary a key and it will return the corresponding value, the same way you get elements from lists with the index.

In [None]:
# create a dictionary; make it meaningful (e.g., countries and capitals, first and last names, etc.)

You can also add items to a dictionary as such:

In [None]:
# add an entry to your dictionary

### Defining custom functions

Often, you will want to repeatedly apply the same sequence of commands to different data. An efficient way to do this can be to define a **custom function**, which takes an **input** or multiple inputs, does something to it, and returns an **output**. The format to define a function is this:

``def function_name(function_input1, function_input2...etc):
      do some stuff 
      return output``

In [None]:
# write a function that takes a number as an input and returns the cube of that number

def cube(number):
    

# write a function that takes a number as an input and returns half of that number


def halve(number):
    


You can also **chain functions together** like this, using the output of one as the input of the next. In this case, we are using the output of our function ``halve`` as the input of our function ``cube``:

In [None]:
cube(halve(4))

Great, you've learned some of foundational aspects of coding in Python! These might all seem very basic, but you'd be surprised what you can accomplish with just the above tools.

## Part 1b: Tabular data with the ``pandas`` library

Everything we've done so far has been in **"base Python."** These features are all included when you download any versioin of Python.  

Very often, however, you'll want to use **"libraries" or "packages,"** collections of code other people have written to accomplish certain tasks. Some are quite general and very widely used, while others are tailored to extremely specific tasks. 

One of the most widely used Python libraries is called ``pandas``, which we will use to work with **tabular data** (think, a spreadsheet).

To access a library, you'll have to ``import`` it. When importing, you can import a package ``as`` a shorthand name for ease of reference. It is convention in Python to import ``pandas`` as ``pd``:

In [None]:
import pandas as pd

The main ``pandas`` data type we will be using is called a ``DataFrame``: basically, a table of data.

There are many, many ways to create ``DataFrames`` with ``pandas``. For example, you could read in a .csv file you have stored on your computer, read an HTML table from a website, create them from lists and/or dictionaries, and so on. Later we will be creating a DataFrame from a list of dictionaries. 

One of the simplest ways to create a ``DataFrame`` is to simply read in a .csv file, where stored locally on your computer or at URL. Here's a URL of a .csv file containing data from 979 IMDB pages. We can access it as a ``DataFrame`` with the method ``pd.read_csv()``:

In [None]:
imdb = "https://raw.githubusercontent.com/justmarkham/DAT7/master/data/imdb_1000.csv"

# read the above url into a dataframe with pd.read_csv()


We've made a DataFrame! We can look at the top N rows of it with the ``head()`` function (default is 5 rows):

In [None]:
# examine the top n rows

We can check its dimensions with ``.shape``:

In [None]:
# check dimensions

We can reference **columns** as such:

In [None]:
# look at a column

We can access the value in a given cell a few ways. Two common ones are like this:

``dataframe[column_name][index]``:

In [None]:
# look at a specific value within a column

We can filter by column values like this:

In [None]:
# filter a column one way

In [None]:
# filter a column another way

Finally, we can manipulate data to create new columns. This is particularly easy when the columns contain numbers only. For example, we could convert minutes into hours by dividing by 60:

In [None]:
# create a new column for duration in hours



## Part 2: Accessing and analyzing Twitter data via the Twitter API

### What's an API?

Many websites and social media platforms have "application programming interfaces" (APIs) which allow you to access their data in formats which are useful for coding, such as JSON or CSV. Many platforms, including Twitter, will release their own Python libraries (and/or Javascript, R, etc.) designed to help you easily access the API with code scripts. We will be using Tweepy, which Twitter released for accessing their API via Python.

When you open Twitter and make a search, you're sending a request (the text of your search), and it's sending you back data (tweets). APIs are doing the same thing, bu the data is going to look different. Take a look at [this link](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) to see a tweet as it would appear in your browser or on your phone, versus how that same tweet looks when you're requesting data from the API.

### Getting access

Most APIs will require you to create "access keys," "authentication tokens," or something similar in order to use them. These can essentially be thought of as usernames and passwords. When you "ask" the API for data, you include these keys/tokens/passwords in the data request so that they can keep track of who is requesting what data, and to prevent unauthorized users from requesting data on your behalf.

As a first step, we will all need to acquire our own API keys. Instructions to do so are below.

1. If you don't have a Twitter account, you will need to create one. Since this is for demo purposes only, feel free to create a "throwaway" style account. If you already have a Twitter account, make sure you are logged in.
2. Click "More" on the left, and go to Account Information. Click "Phone" and add your phone number if you haven't already. Twitter requires a registered phone number to get API keys.

Now we need to create a developer account associated with the Twitter account. Follow steps 1 and 2 from this Twitter page (summarized below).

https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2

3. Go to https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api
4. Scroll down to "Twitter API access levels and versions" section
5. In the table, click "Sign up" underneath "Essential".
6. Follow instructions to create your developer account.
7. Once you create your app, copy your bearer token into this notebook as a string variable




In [None]:
BEARER_TOKEN = "<YOUR_TOKEN_HERE>"

**NOTE: Storing access keys inside a notebook like this is BAD PRACTICE!** In the real world, you would always want to store these somewhere outside of your code (e.g., in a .txt file), and then read them into your coding environment. That way you can share your code without also sharing your access keys. 

Now that we've assigned a variable with our bearer token, we can set up a Tweepy client and configure it so that it sends our bearer token with each request.

First, we have to import ``tweepy``:

In [None]:
import tweepy

Then we set up a client to manage our bearer token: 

In [None]:
client = tweepy.Client(bearer_token=BEARER_TOKEN)

Now we can start getting Twitter data! The code below is complicated so I've already typed it out, but the key for now is to understand the ``my_query`` variable. This is a specially formatted string where we tell the API what sort of tweest we want. There are many different ways to do this.

My query is doing the following:

- Tweets containing the keyword "new york"
- Filtering out retweets
- English only
- Verified accounts only

The query syntax can be used to make very complicated requests; more info about building queries can be found here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

The code below is a little complicated, and it's ok if you don't fully understand how it works. The Twitter API only allows you to get 100 tweets per request with the ``search_recent_tweets()`` function, but 100 tweets isn't very many. So we're going to combine that with the ``Paginator`` to easily make multiple of these requests. The total number of tweets this function will return is ``max_results * limit``:

In [None]:
# Replace with your own search query

my_query = 'new york -is:retweet lang:en is:verified'

tweets = tweepy.Paginator(client.search_recent_tweets, 
                              query=my_query, # here's where we pass our query string above
                              tweet_fields=['created_at', 'author_id', "possibly_sensitive", "public_metrics"], # here we tell the API which data fields we want
                              max_results=100, # the number of tweets per request (max 100)
                              limit=100).flatten() # 100 * 100 = 10,000 tweets returned


df = pd.DataFrame(tweets)

Let's take a look!

In [None]:
# sample from your dataframw




First, let's filter out sensitive content:

In [None]:
# remove sensitive content

You'll notice that rather than usernames, the API returned ``author_id`` as a number corresponding to a username. We can lookup the username with ``client.get_user(id=author_id)``:

In [None]:
# pick an author ID and examine who it belongs to

author_id = 

client.get_user(id=author_id)

Take a look at the ``public_metrics`` column. You'll notice that each value is actually an entire dictionary, containing the number of retweets, likes, quotes, and replies.. 

In [None]:
# access a value in the public_metrics column

What we're going to do here will combine our knowledge of lists, for loops, dictionaries and dataframes. We're going to iterate over each entry in ``df["public_metrics"]``, extract the values for retweets and likes, add those values to respective lists, and then add those lists to our dataframe as new columns.

*Note: There are faster and more 'pythonic' ways of doing this, but this is fine for now :)*

In [None]:
# create a for loop where you extract the values of likes and retweets 
# from each element of public_metrics and make new columns



In [None]:
# check your work



The top rows have a lot of zeros for likes/retweets, since they're the most recent. Let's sort the whole dataframe by retweets with the ``.sort_values()`` method:

In [None]:
# sort your dataframe

### Plotting

Now that we have some data, we can start to produce basic visualizations of its structure. It's often good to look at your data with plots to get a sense of its structure before going into more complicated data representations such as regression models.

#### Univariate plots

When working with social media datasets, you'll often find that users have very unequal rates of participation, with some almost never commenting/posting/liking/etc., and some doing so very often. Looking at the **distribution** of participation rates is a common first step in analyzing online user behavior.

Let's start by visualizating the distribution of tweets per account in our data. To do this, we will produce a **histogram**, which is an extremely common type of plot for visualizing **univariate** (i.e., single variable) distributions. Since we are only looking at the distribution of a single variable---tweets per author---a histogram is a good choice.

In order to plot this, we'll need a list of every unique author in the dataset, and the number of times they appear. Fortunately, ``DataFrames`` have a built-in method for doing this, ``.value_counts()``:

In [None]:
client.get_user(id= )

Those values on the right-hand side are the values we want to plot. To do so, let's first import our plotting library:

In [None]:
import matplotlib.pyplot as plt

Now we can plot! Here's the code for plotting a histogram.

We're going to start each plotting cell with ``plt.figure(figsize=(10,10)`` to set the size of the plot (the default size is really small), and end every plotting sell with ``plt.show()`` to show the plot. What goes in between will determine the type of plot and the data plotted, along with settings such as axis labels, titles, aesthetic options, etc.

*Note: a full list of colors for ``plt`` can be found here: https://matplotlib.org/stable/gallery/color/named_colors.html*

In [None]:
plt.figure(figsize=(10,10))

# make a histogram of tweet volume by user



plt.show()

This type of long-tailed distribution is very common in social media datasets. Generally, a relatively small number of accounts (the "long tail") that participate very frequently, and a very large number of accounts that participate very little. In fact, it would be surprising to find that a social media dataset *didn't* show this type of distribution.

Another, non-visual way to look at the distribution of a single variable in your data is to use ``.describe()``:


In [None]:
df['rts'].describe()

#### Bivariate plots

What if we wanted to plot the relationship of one variable to another in our dataset? A common way to do this would be to use a scatterplot. The code will be pretty similar, except instead of plotting the output of a ``.value_counts()``, we're going to plot two columns of data.

In [None]:
plt.figure(figsize=(10,10))

# make a scatter plot showing the relationship between likes and retweets




plt.show()

As you might expect, the above plot shows a strong, relatively linear relationship between retweets and likes. Tweets that get a lot of one are likely to have a lot of the other as well. 

Finally, what if we wanted to look at the temporal distribution of tweets in our data. Do people tweet more at certain times of day? We can look at the ``created_at`` field to see when a tweet was posted. This is in a Python format called a ``Timestamp``, which makes working with time data very easy.

In [None]:
t = df['created_at'][0]

t.hour

Let's pull out the hour of each tweet and create a separate column:

In [None]:
hours = []

for i in df['created_at']:
    
    # fill in the for loop here
    
    
df['hour'] = hours

In [None]:
hours_counts = df['hour'].value_counts()
hours_counts = hours_counts.sort_index() # have to sort the index 

In [None]:
plt.figure(figsize=(10,10))

# plot the temporal (24-hour) distribution of tweets



plt.show()

## Part 2b: Basic text analysis

Finally, we can explore the text data. Unlike counts of tweets or ratios of likes to retweets, text data is pretty "unstructured." It takes more effort to turn text into something we can plot or visualize than other sorts of data. Working with unstructured data is often referred to as "natural language processing," or NLP. It is a massive field at the intersection of data science, computer science, and linguistics. We're just scratching the surface here.

Fortunately, there are a lot of amazing libraries for text analysis (and other machine learning tasks). We're going to be using ``spacy``, ``textblob``, and ``gensim``. You might have to install them first; running the below cell will do that for you (it may take a minute or two).

Then import the packages (this might take a minute):

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

#### Sentiment analysis with ``spacy``

``spacy`` makes a lot of complicated natural language processing tasks really easy. However, it uses a  specific kind of syntax which is fairly unique. I've included comments that try to describe what is going on, but don't be surprised if you feel confused about what's going on here.

As an example, let's try sentiment analysis. We're going to use a pretrained machine learning model to predict the sentiment (how positive or negative) of a tweet from the text data alone. The below code does that in only a few lines:

In [None]:
nlp = spacy.load('en_core_web_sm') # here we are loading a pretrained machine learning model for handling text data

nlp.add_pipe('spacytextblob') # here we're adding a "pipe" to the model; 
                              # basically, an additional step in between input and output, 
                              # which is "switched off" by default.   

# let's make a list to store the sentiment (aka polarity) scores and a list for the subjectivity scores:

pols = []
subs = []
        
# Now we're going to iterate over all the tweets:

for i in df['text']:
    
    doc = nlp(i) # here we're creating a variable doc which is the output of passing our text through the nlp model
                    # from here we can extract all the information the model extracted from the unstructured text    
    polarity = doc._.blob.polarity # here we pull out the polarity (sentiment)    
    pols.append(polarity)

    
df['polarity'] = pols

In [None]:
plt.figure(figsize=(10,10))

# make a histogram of polarity scores

plt.show()

#### Semantic similarity with ``word2vec`` and Yelp reviews

A common task in NLP is **"semantic representation"**: finding ways of representing the "meaning" of a word, sentence, or document. Of course, computers don't understand meaning. But they can learn from patterns in really large corpora of natural language, and use the information in those patterns to try to predict which words, sentences, and documents are "similar" to each other in other collections of text.

To do this, they use methods of converting unstructured text data in **vector** representation. By vector representation, we mean taking each unit (word, sentence, or document) into a sequence of numbers.

The simplest, and original, way of doing this is called a **"document-term matrix."** Imagine a table where each row represents a document, and each column represents a unique word in the collection of documents. The value in a given cell would represent the number of times the column word appears in the row document.

Imagine a really simple corpus of three documents:
- Doc 1: "I like dogs"
- Doc 2: "I dislike dogs"
- Doc 3: "I like cats"

The document-term matrix would look like this:

|             | I | like | dislike | dogs | cats | 
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Doc 1      | 1       | 1       | 0       | 1       | 0       |
| Doc 2   | 1        | 0       | 1       | 1       | 0       |
| Doc 3   | 1        | 1       | 0       | 0       | 1       |

For years, NLP researchers have shown that you can use these sorts of matrices as the starting point for all sorts of representations of how these documents relate to one another. You can imagine these sequences of numbers for each document as coordinates in high-dimensional space; if two documents are closer to each other in that space, they are likely to have similar meanings.

More recently, rather than using simple word-count information, very complicated machine learning models are used to take unstructured data and represent them as vectors. These models are called **word embedding models** and are widespread in NLP research, including applied NLP. We're going to be using one called ``word2vec`` to examine which words are most similar to one another in a corpus of Yelp reviews.

In [None]:
yelp_url = "https://raw.githubusercontent.com/justmarkham/DAT7/master/data/yelp.csv"

In [None]:
# read the data in the above url into a dataframe

yelp = 

Before we throw our language into word2vec, we need to modify it to make sure it's in the correct format to act as an input as a model. We need to do two things: **clean** the text, and **tokenize** it.

In [None]:
# clean text

processed_texts = []

# lowercase

yelp['text'] = yelp['text'].str.lower()

# remove punctuation

punc = '''!()-[]{};:“'"\,<>./?@#$%^&*_~—'''

for p in punc:
    
    yelp['text'] = yelp['text'].str.replace(p, "")



# remove stopwords

stopwords = list(nlp.Defaults.stop_words) # we can get a list of stopwords from the spacy nlp model 

for i in yelp['text']:
    
    tokens = i.split() # split the text string into a list of words
    
    tokens = [t for t in tokens if t not in stopwords]
            
    processed_texts.append(tokens)
             
processed_texts[0]

We can now pass these tokenized documents to the word2vec model.

In [None]:
model = gensim.models.Word2Vec(processed_texts)

We can do a lot of stuff with this model, but one thing is word-level semantic similarity. We can use the ``most_similar()`` method to input a word and see which other words are most similar to it in our corpus:

In [None]:
model.wv.most_similar('waiter', topn=20) 

### The end

Some cool social science/digital humanities papers that use word embeddings:

- [A Framework for the Computational Linguistic Analysis of Dehumanization](https://www.frontiersin.org/articles/10.3389/frai.2020.00055/full)
    - Authors look at the word embeddings of LGBTQ-related terms in the NYT over multiple decades, showing how their semantic associations change as LGBTQ community became less stigmatized

- [The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings
](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)

    - Authors examine text of millions of books over 100 years, show how gender and class associations have shifted
    
- [Leveraging the alignment between machine learning and intersectionality: Using word embeddings to measure intersectional experiences of the nineteenth century U.S. South](https://www.sciencedirect.com/science/article/pii/S0304422X21000115)
    - Author analyzes narratives from early 19th-century US south, uses intersectionality theory as a framework to analyze embedding spaces