***
# Assignment 2 Social Graphs 2023

**Link to assignment description:** https://github.com/SocialComplexityLab/socialgraphs2023/blob/main/assignments/Assignment2.ipynb

This assignment applies the following formatting:

* Original assignment text is unformatted and tasks are in bullet point format.

> **✅ Solution**: A solution is marked with a checkmark emoji.

> **📊 Graph**: A caption for a graph has a graph emoji.

> **💬 Comment**: An additional comment uses a speech bubble emoji.
***

## Table of Contents
* [0. Building the network](#network)
* [1. Network visualization and basic stats](#network_stats)
    * [1.a Stats](#stats)
    * [1.b Visualization](#viz)
* [2. Word-clouds](#wordclouds)
* [3. Communities](#communities)
* [4. Sentiment of communities](#sentiment)
* [5. References](#references)

## Requirements <a class="anchor" id="requirements"></a>

In [4]:
from collections import Counter
from nltk import FreqDist
from nltk.corpus import stopwords, PlaintextCorpusReader
from nltk.stem import WordNetLemmatizer
from nltk.text import Text
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

import math
import matplotlib.pyplot as plt
import networkx as nx
import nltk
import os
import pandas as pd
import pickle as pkl
import re
import string

## 0. Building the network <a class="anchor" id="network"></a>
To create our network, we downloaded the rapper Wiki pages from each coast (during Week 4) and linked them via the hyperlinks connecting pages to each other. To achieve this goal we have used regular expressions.

- Explain the strategy you have used to extract the hyperlinks from the Wiki-pages, assuming that you have already collected the rapper pages with the Wikipedia API.

- Show the regular expressions you have built and explain in details how they work.

## 1. Network visualization and basic stats <a class="anchor" id="network_stats"></a>
Visualize your network of rappers (from lecture 5) and calculate stats (from lecture 4 and 5). For this exercise, we assume that you have already generated the network and extracted the largest weakly connected component (the "largest weakly connected component" of a directed network is the subgraph consisting of the nodes that would constitute the largest connected component if the network were undirected) . The visualization and statistics should be done for the largest weakly connected component only.

## Exercise 1a: Stats <a class="anchor" id="stats"></a>
- What is the number of nodes in the network?

- What is the number of links?

- Who is the top connected rapper? (Report results for the in-degrees and out-degrees). Comment on your findings. Is this what you would have expected?

- Who are the top 5 most connected east-coast rappers (again in terms of in/out-degree)? 

- Who are the top 5 most connected west-coast rappers (again in terms of in/out-degree)?

- Plot the in- and out-degree distributions for the whole network. 
   - Use axes that make sense for visualizing this particular distribution.
   - What do you observe? 
   - Give a pedagogical explaination of why the in-degree distribution is different from the out-degree distribution?

- Find the exponent (by using the `powerlaw` package) for the in- and out-degree distributions. What does it say about our network?

- Compare the two degree distributions to the degree distribution of a *random network* (undirected) with the same number of nodes and probability of connection *p*. Comment your results.

## Exercise 1b: Visualization <a class="viz" id="viz"></a>
- Create a nice visualization of the total (directed) network:
   - Color nodes according to the role;
   - Scale node-size according to degree;
   - Get node positions based on either the Force Atlas 2 algorithm, or the built-in algorithms for networkX;
   - Whatever else you feel like that would make the visualization nicer.

- Describe the structure you observe. What useful information can you decipher from this?

## 2. Word-clouds <a class="anchor" id="wordclouds"></a>

Create your own version of the word-clouds (from lecture 7). For this exercise we assume you know how to download and clean text from rappers' Wikipedia pages.

Here's what you need to do:
- Create a word-cloud for each coast according to the novel TF-TR method. Feel free to make it as fancy as you like. Explain your process and comment on your results.

- For each coast, what are the 5 words with the highest TR scores? Comment on your result.

## 3. Communities <a class="anchor" id="communities"></a>
Find communities and their modularity. Here's what you need to do:
- In your own words, explain what the measure "modularity" is, and the intuition behind the formula you use to compute it. 

- Find communities in the network

- Explain how you chose to identify the communities: Which algorithm did you use and how does it work?

- Comment on your results:
    - How many communities did you find in total?

- Compute the value of modularity with the partition created by the algorithm.

- Plot and/or print the distribution of community sizes (whichever makes most sense). Comment on your result.


Now, partition your rappers into two communities based on which coast they represent.
  - What is the modularity of this partition? Comment on the result.

## 4. Sentiment of communities <a class="anchor" id="sentiment"></a>

Analyze the sentiment of communities.

Here's what you need to do (use the LabMIT wordlist approach):

- Calculate and store sentiment for every rapper

- Create a histogram of all rappers' associated sentiments.

- What are the 10 rappers with happiest and saddest pages?

Now, compute the sentiment of each coast: 
- Which is the happiest and which is saddest coast according to the LabMT wordlist approach? (Take the coast's sentiment to be the average sentiment of the coast's rappers' pages (disregarding any rappers with sentiment 0).

- Use the "label shuffling test" to test if the coast with the highest wikipedia page sentiment has a page sentiment that is significantly higher (5% confidence bound) than a randomly selected group of rappers of the same size.

- Does the result make sense to you? Elaborate.

A couple of additional instructions you will need below:
- Average the average sentiment of the nodes in each community to find a community-level sentiment.

## 5. References <a class="anchor" id="references"></a>

[1] Dodds, Peter Sheridan, et al. "Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter." PloS one 6.12 (2011): e26752. [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752)

[2] Longhurst, J. W. S., D. Rayfield, and D. E. Conlan. 1994. "The Impacts Of Road Transport On Urban Air Quality-A Case Study Of The Greater Manchester Region." WIT Transactions on Ecology and the Environment 3.
