<a href="https://colab.research.google.com/github/meiqingli/dssj_summer2022/blob/main/DSSJ_Final_Project_Li.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Geographic Disparity in Publishing: A Sentiment Analysis from Subreddit Community**
**Meiqing Li | PhD Candidate in City and Regional Planning**

**DSSJ @ Berkeley**

In [1]:
# import packages
import pandas as pd
import spacy

# load the English preprocessing pipeline
nlp = spacy.load("en_core_web_sm")

## Introduction
The subreddit of interest is [Publishing](https://www.reddit.com/r/publishing/), which is a community for publishing professionals to discuss various issues such as career and business opportunities. Through initial qualitative analysis, we found it a relatively small community compared with other subreddits by number of readers and posts. In terms of topics, we found discussions on different types of publishing, for example self-publishing,publishing through agent, and digital publishing, as well as publishing in the US vs. Canada. These are all potential areas for further investigation. 

Before cleaning up, our dataset includes 7,330 posts and 19,538 comments. By removing blank texts, we reduced the size of dataset to 2,388 posts and 19,054 comments. There have been a total of 1,922 unique users for posts, and 4,033 unique users for comments. Interestingly, we noticed that the most frequent user's posts take up 1% of the overall posts while the next only takes up 0.3%. Another observation we find interesting is that people are curious about the differences in publishing between US and Canada, whilst it seems that most of the posts are from people trying to navigate publishing but very little from the publisher side. For this project, I would like to look into the posts by different people involved in publishing, specifically the themes and sentiments within posts from different areas or regions. For example, how do the sentiments and concerns regarding publishing differ between US, Canada and other countries, or different parts of the US? Also, within each geographic region, what are the shares of posts by people interested in publishing, publishers, agents, and other stakeholders? I am interested in studying these dynamics because they can give us a better understanding about the broader landscape of publishing. 

In order to address these questions, I will look into several NLP techniques inlcuding *topic modeling* and *sentiment analysis*. Specifically, I will use tools such as *concordance* to identify texts with geographical key words, then extract topical words and sentiments. 

In [2]:
# Github url with comments data
url_comments = 'https://gist.githubusercontent.com/meiqingli/5d83de4c508a0564359b2dd07c6839b1/raw/a83a44b0a85ac8ccc10f717495ff6a85c2f93878/comments.csv'

# Github url with post data
url_posts = 'https://gist.githubusercontent.com/meiqingli/2d128e1cac170d71b3820c51c6c3766e/raw/5be159378011713f1daaca4d683bb6875d11f61f/submissions.csv'

# Reads the csv comments file from github
df_comments = pd.read_csv(url_comments)

# Reads the csv posts file from github
df_posts = pd.read_csv(url_posts)

In [3]:
# Tells us about the general shape of the dfs
df_posts.shape

(7330, 18)

In [4]:
df_comments.shape

(19538, 11)

In [5]:
# Allows us to see which columns our dfs have
list(df_posts)

['idint',
 'idstr',
 'created',
 'self',
 'nsfw',
 'author',
 'title',
 'url',
 'selftext',
 'score',
 'subreddit',
 'distinguish',
 'textlen',
 'num_comments',
 'flair_text',
 'flair_css_class',
 'augmented_at',
 'augmented_count']

In [6]:
list(df_comments)

['idint',
 'idstr',
 'created',
 'author',
 'parent',
 'submission',
 'body',
 'score',
 'subreddit',
 'distinguish',
 'textlen']

## Preprocessing


```
In this section, you will address:

- Major decisions in preprocessing (subsetting, any steps additional to the preprocessing recipe in the module).
- Include the code necessary to preprocess your data.
- Save the results! (For faster computation next time)

This section will address the choices you made in preprocessing the data. You can reference the preprocessing recipe from Module 2, and note any deviations/ adjustments you made to the recipe and the reason for those adjustments. Include the code to preprocess your data.

If you removed any section of the data, report what proportion of the data set was excluded and why.

Identify any differences from the quantitative exploration in the Introduction (less words, fewer unique posters, etc).
```

In [None]:
# developing
# Code from this section is largely based on joint work by Madeline Bossi, Soliver Fusi, Janiya Peters, and me. 

## Analysis

```
This section is where you get to deploy and discuss your analysis. Address the following points:

Why is this analysis appropriate for the research question?
Include the code for performing the analysis. Note any analytical choices made (for example choosing parameters)
Interpret the results of your analysis
Discuss how the analysis relates to your research question
Address each research question (if you have multiple) in a separate section, and include a subsection (###) for each technique used. Again, the goal is to weave the code and text together, the same way you would weave figures and text together in a traditional research paper. Speaking of figures- try to include at least one visualization of your results (a table, figure, etc) per quantitative analysis.

The balance of quantitative and quantitative techniques will depend a bit on your project, but you should have at least some quantitative and qualitative elements for research question.

A sample analysis outline is included below:
```

### Analysis 1

```
A sample tf-idf analysis.

Describe tf-idf;
Describe choices made (i.e. parameters of tf-idf);
Include predictions/original expectations for the analysis.
```

```
*   Comment on the graph of tf-idf scores. Does it meet expectations?
*   Comment on the highest tf-idf scores.
```



```
Conclude the analysis:

How did this analysis address your research question?
Reflect on any new or unexpected patterns, consider further exploratory analysis into those patterns.
Identify any limitations of the analysis and how they may impact the results.
And then on to the next analysis/question.
```



## Conclusion

```
This section is a brief summary of your analyses and final thoughts:

What was your conclusions in response to your research questions?
What are potential implications of these results to the broader community?
Reflect on how this project relates the the themes of the workshop.
How would you further develop this research project?
```

## References
*   King, G., Pan, J., & Roberts, M. E. (2013). How Censorship in China Allows Government Criticism but Silences Collective Expression. American Political Science Review, 107(2), 326–343. https://doi.org/10.1017/S0003055413000014
*   Getting Started with Sentiment Analysis using Python. (n.d.). Retrieved July 17, 2022, from https://huggingface.co/blog/sentiment-analysis-python

**Tools:**
*   [Sentiment analysis](https://github.com/cjhutto/vaderSentiment)
*   [Getting started with sentiment analysis](https://huggingface.co/blog/sentiment-analysis-python)