<a href="https://colab.research.google.com/github/meiqingli/dssj_summer2022/blob/main/DSSJ_Final_Project_Li.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Geographic Disparity in Publishing: A Sentiment Analysis from Subreddit Community**
**Meiqing Li | PhD Candidate in City and Regional Planning**

**DSSJ @ Berkeley**

In [17]:
# import packages
import pandas as pd
import spacy
from gensim.models.phrases import Phrases, Phraser

# load the English preprocessing pipeline
nlp = spacy.load("en_core_web_sm")

## Introduction
The subreddit of interest is [Publishing](https://www.reddit.com/r/publishing/), which is a community for publishing professionals to discuss various issues such as career and business opportunities. Through initial qualitative analysis, we found it a relatively small community compared with other subreddits by number of readers and posts. In terms of topics, we found discussions on different types of publishing, for example self-publishing,publishing through agent, and digital publishing, as well as publishing in the US vs. Canada. These are all potential areas for further investigation. 

Before cleaning up, our dataset includes 7,330 posts and 19,538 comments. By removing blank texts, we reduced the size of dataset to 2,388 posts and 19,054 comments. There have been a total of 1,922 unique users for posts, and 4,033 unique users for comments. Interestingly, we noticed that the most frequent user's posts take up 1% of the overall posts while the next only takes up 0.3%. Another observation we find interesting is that people are curious about the differences in publishing between US and Canada, whilst it seems that most of the posts are from people trying to navigate publishing but very little from the publisher side. For this project, I would like to look into the posts by different people involved in publishing, specifically the themes and sentiments within posts from different areas or regions. For example, how do the sentiments and concerns regarding publishing differ between US, Canada and other countries, or different parts of the US? Also, within each geographic region, what are the shares of posts by people interested in publishing, publishers, agents, and other stakeholders? I am interested in studying these dynamics because they can give us a better understanding about the broader landscape of publishing. 

In order to address these questions, I will look into several NLP techniques inlcuding *topic modeling* and *sentiment analysis*. Specifically, I will use tools such as *concordance* to identify texts with geographical key words, then extract topical words and sentiments. 

In [4]:
# Github url with post data
url_posts = 'https://gist.githubusercontent.com/meiqingli/2d128e1cac170d71b3820c51c6c3766e/raw/5be159378011713f1daaca4d683bb6875d11f61f/submissions.csv'

# Github url with comments data
url_comments = 'https://gist.githubusercontent.com/meiqingli/5d83de4c508a0564359b2dd07c6839b1/raw/a83a44b0a85ac8ccc10f717495ff6a85c2f93878/comments.csv'

# Reads the csv posts file from github
df_posts = pd.read_csv(url_posts)

# Reads the csv comments file from github
df_comments = pd.read_csv(url_comments)

In [5]:
# Tells us about the general shape of the dfs
df_posts.shape

(7330, 18)

In [6]:
df_comments.shape

(19538, 11)

In [7]:
# Allows us to see which columns our dfs have
list(df_posts)

['idint',
 'idstr',
 'created',
 'self',
 'nsfw',
 'author',
 'title',
 'url',
 'selftext',
 'score',
 'subreddit',
 'distinguish',
 'textlen',
 'num_comments',
 'flair_text',
 'flair_css_class',
 'augmented_at',
 'augmented_count']

In [8]:
list(df_comments)

['idint',
 'idstr',
 'created',
 'author',
 'parent',
 'submission',
 'body',
 'score',
 'subreddit',
 'distinguish',
 'textlen']

## Preprocessing
We mainly made the following preprocessing steps to the original dataset:

*   Drop redundant columns;
*   Remove rows in posts and comments that are either "removed" or "deleted";
*   Drop rows with null values in `selftext` and `body` columns;
*   Drop duplicate posts and columns;
*   Text cleaning using `spaCy`;
*   Phrase modeling with `gensim`;
*   Save preprocessed texts to new data frames for further analysis.

Code from this section is largely based on joint work by Madeline Bossi, Soliver Fusi, Janiya Peters, and me. 

In [9]:
# Drops less useful columns

df_posts_short = df_posts.drop(['subreddit', 'url', 'distinguish', 'flair_text', 'flair_css_class', 'augmented_at', 'augmented_count'], axis=1)
df_comments_short = df_comments.drop(['subreddit', 'distinguish'], axis=1)

In [10]:
# Selects all rows that don't have 'removed' or 'deleted' in certain columns
df_comments_noBlanks = df_comments_short.loc[~df_comments_short['body'].isin(['[removed]', '[deleted]' ]),:]
df_posts_noBlanks = df_posts_short.loc[~df_posts_short['selftext'].isin(['[removed]', '[deleted]' ]),:]

In [11]:
# Drops rows with null values in 'selftext' or 'body' column (assuming we want to analyze 'selftext')
df_posts_noBlanks = df_posts_noBlanks.dropna(subset=['selftext'])
df_comments_noBlanks = df_comments_noBlanks.dropna(subset=['body'])

In [14]:
# drop duplicate posts or comments
df_posts_noBlanks = df_posts_noBlanks.drop_duplicates(subset=['selftext'])
df_comments_noBlanks = df_comments_noBlanks.drop_duplicates(subset=['body'])

In [15]:
# Final number of posts
df_posts_noBlanks.shape

(2385, 11)

In [16]:
# Final number of comments
df_comments_noBlanks.shape

(18754, 9)

Our original dataset has 7,330 rows of posts and 19,538 rows of comments. After preprocessing, the dataset contains 2,385 posts and 18,754 comments, which are small subsets of the original ones. 

In [18]:
def clean(token):
    """Helper function that specifies whether a token is:
        - punctuation
        - space
        - digit
    """
    return token.is_punct or token.is_space or token.is_digit

def line_read(df, text_col='selftext'):
    """Generator function to read in text from df and get rid of line breaks."""    
    for text in df[text_col]:
        yield text.replace('\n', '')

def preprocess_posts(df, text_col='selftext', allowed_postags=['NOUN', 'ADJ']):
    """Preprocessing function to apply to the posts dataframe."""
    for parsed in nlp.pipe(line_read(df, text_col), batch_size=1000, disable=["ner"]):
        # Gather lowercased, lemmatized tokens
        tokens = [token.lemma_.lower() if token.lemma_ != '-PRON-'
                  else token.lower_ 
                  for token in parsed if not clean(token)]
        # Remove specific lemmatizations, and words that are not nouns or adjectives
        tokens = [lemma
                  for lemma in tokens
                  if not lemma in ["'s",  "’s", "’"] and not lemma in allowed_postags]
        # Remove stop words
        tokens = [token for token in tokens if token not in spacy.lang.en.stop_words.STOP_WORDS]
        yield tokens

# created a different preprocessing function for the comments df because the relevant column name is different
def preprocess_comments(df, text_col='body', allowed_postags=['NOUN', 'ADJ']):
    """Preprocessing function to apply to the comments dataframe."""
    for parsed in nlp.pipe(line_read(df, text_col), batch_size=1000, disable=["ner"]):
        # Gather lowercased, lemmatized tokens
        tokens = [token.lemma_.lower() if token.lemma_ != '-PRON-'
                  else token.lower_ 
                  for token in parsed if not clean(token)]
        # Remove specific lemmatizations, and words that are not nouns or adjectives
        tokens = [lemma
                  for lemma in tokens
                  if not lemma in ["'s",  "’s", "’"] and not lemma in allowed_postags]
        # Remove stop words
        tokens = [token for token in tokens if token not in spacy.lang.en.stop_words.STOP_WORDS]
        yield tokens


In [19]:
# creates a list of lists of lemmas in each post
lemmas_posts = [line for line in preprocess_posts(df_posts_noBlanks)]

# creates a list of lists of lemmas in each comment
lemmas_comments = [line for line in preprocess_comments(df_comments_noBlanks)]

In [22]:
# flattens the list of lists into one big list to facilitate counting lemma frequency
flat_lemmas_posts = [item for sublist in lemmas_posts for item in sublist]
# flat_lemmas_posts[:30]

In [21]:
# creates a df to see which lemmas are the most frequent
from collections import Counter
posts_count = Counter(flat_lemmas_posts)

lemma_freq_df_posts = pd.DataFrame.from_dict(posts_count, orient='index').reset_index()
lemma_freq_df_posts = lemma_freq_df_posts.rename(columns={'index':'lemma', 0:'count'})
lemma_freq_df_posts.sort_values('count', ascending=False)[:30]

Unnamed: 0,lemma,count
201,book,3134
187,publish,1794
2,publishing,1448
17,work,1393
3,like,1283
120,know,1130
188,want,1060
290,publisher,1059
174,write,971
77,look,836


In [25]:
# replicates this same process for the comments
flat_lemmas_comments = [item for sublist in lemmas_comments for item in sublist]
# flat_lemmas_comments[:30]

In [24]:
comments_count = Counter(flat_lemmas_comments)

lemma_freq_df_comments = pd.DataFrame.from_dict(comments_count, orient='index').reset_index()
lemma_freq_df_comments = lemma_freq_df_comments.rename(columns={'index':'lemma', 0:'count'})
lemma_freq_df_comments.sort_values('count', ascending=False)[:30]

Unnamed: 0,lemma,count
4,book,10150
72,work,5754
322,publisher,5155
6,publish,4775
25,publishing,4580
17,good,4248
161,like,4033
324,want,3643
170,author,3405
940,agent,3217


In [26]:
# Forming bigrams and trigrams

# Create bigram and trigram models for posts
bigram_posts = Phrases(lemmas_posts, min_count=10, threshold=100)
trigram_posts = Phrases(bigram_posts[lemmas_posts], min_count=10, threshold=50)  
bigram_phraser_posts = Phraser(bigram_posts)
trigram_phraser_posts = Phraser(trigram_posts)

# Form trigrams
trigrams_posts = [trigram_phraser_posts[bigram_phraser_posts[doc]] for doc in lemmas_posts]

# Create bigram and trigram models for comments
bigram_comments = Phrases(lemmas_comments, min_count=10, threshold=100)
trigram_comments = Phrases(bigram_comments[lemmas_comments], min_count=10, threshold=50)  
bigram_phraser_comments = Phraser(bigram_comments)
trigram_phraser_comments = Phraser(trigram_comments)

# Form trigrams
trigrams_comments = [trigram_phraser_comments[bigram_phraser_comments[doc]] for doc in lemmas_comments]



In [27]:
# joins each into a string
trigrams_joined_posts = [' '.join(trigram) for trigram in trigrams_posts]
trigrams_joined_posts[0]

'industry experience publishing like enter field specifically editing usual catch-22 hold true job need experience experience need job find entry_level work prove difficult live baltimore md feel exhaust obvious avenue send resume cover_letter magazine newspaper publishing house etc locate oppose internship course college student tend exclude consideration additionally support work pay obvious sense priority realize reader subreddit shot dark use advice person familiar industry consider send email editor blog read ask similar guidance perceive brazen'

In [28]:
# joins each into a string
trigrams_joined_comments = [' '.join(trigram) for trigram in trigrams_comments]
trigrams_joined_comments[0]

'great tip start market book online publish'

In [31]:
# adds lemmas column to submissions df (with option to make csv)
df_posts_noBlanks.insert(loc=7, column='lemmas', value=trigrams_joined_posts)
df_posts_noBlanks = df_posts_noBlanks[~df_posts_noBlanks['lemmas'].isin([''])]
df_posts_noBlanks.to_csv('submissions_lemmas.csv', index=False)

# adds lemmas column to comments df (with option to make csv)
df_comments_noBlanks.insert(loc=7, column='lemmas', value=trigrams_joined_comments)
df_comments_noBlanks = df_comments_noBlanks[~df_comments_noBlanks['lemmas'].isin([''])]
df_comments_noBlanks.to_csv('comments_lemmas.csv', index=False)

In [None]:


# Github url with post data
url_posts_processed = ''

# Github url with comments data
url_comments_processed = ''

# Reads the csv posts file from github
df_posts_processed = pd.read_csv(url_posts_processed)

# Reads the csv comments file from github
df_comments_processed = pd.read_csv(url_comments_processed)

In [1]:
# compare with the original dataset
# Identify any differences from the quantitative exploration in the Introduction (less words, fewer unique posters, etc).

## Analysis

```
This section is where you get to deploy and discuss your analysis. Address the following points:

Why is this analysis appropriate for the research question?
Include the code for performing the analysis. Note any analytical choices made (for example choosing parameters)
Interpret the results of your analysis
Discuss how the analysis relates to your research question
Address each research question (if you have multiple) in a separate section, and include a subsection (###) for each technique used. Again, the goal is to weave the code and text together, the same way you would weave figures and text together in a traditional research paper. Speaking of figures- try to include at least one visualization of your results (a table, figure, etc) per quantitative analysis.

The balance of quantitative and quantitative techniques will depend a bit on your project, but you should have at least some quantitative and qualitative elements for research question.

A sample analysis outline is included below:
```

### Analysis 1

```
A sample tf-idf analysis.

Describe tf-idf;
Describe choices made (i.e. parameters of tf-idf);
Include predictions/original expectations for the analysis.
```

In [None]:
# incorporating time
# 
datetimes = pd.to_datetime(df['created'], unit='s')

```
*   Comment on the graph of tf-idf scores. Does it meet expectations?
*   Comment on the highest tf-idf scores.
```



```
Conclude the analysis:

How did this analysis address your research question?
Reflect on any new or unexpected patterns, consider further exploratory analysis into those patterns.
Identify any limitations of the analysis and how they may impact the results.
And then on to the next analysis/question.
```



It has already been found that a very small percentage of Reddit’s users create the vast majority of the site’s content, so we would not be surprised if only a few users could influence the discourse of entire subreddits. Identifying these users would help us understand how a subreddit's discourse is shaped. 

In [None]:
df = df.sort_values(by='score', ascending=False)[:1000]
# Sanity check
print(df.shape)

In [None]:
df.author.nunique()

In [None]:
df.plot('score', 'num_comments', kind='scatter', color='black', alpha=0.25, logy=True)

In [None]:
## Rank by geographic areas, map distribution
## Word embedding and topic modeling
## Cluster Analysis
## Sentiment analysis
## Network analysis

## Conclusion

```
This section is a brief summary of your analyses and final thoughts:

What was your conclusions in response to your research questions?
What are potential implications of these results to the broader community?
Reflect on how this project relates the the themes of the workshop.
How would you further develop this research project?
```

## References
*   Getting Started with Sentiment Analysis using Python. (n.d.). Retrieved July 17, 2022, from https://huggingface.co/blog/sentiment-analysis-python
*   King, G., Pan, J., & Roberts, M. E. (2013). How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review, 107(2), 326–343. https://doi.org/10.1017/S0003055413000014
*   Lucy, L., Demszky, D., Bromley, P., & Jurafsky, D. (2020). Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks. AERA Open, 6(3), 233285842094031. https://doi.org/10.1177/2332858420940312



**Tools:**
*   [Sentiment analysis](https://github.com/cjhutto/vaderSentiment)
*   [Getting started with sentiment analysis](https://huggingface.co/blog/sentiment-analysis-python)