# Fundamentals of Social Data Science 
## Week 2 Day 2 Lab: Downloading to Wikipedia

Today we will review some changes to the Wikipedia code. These changes will considerably alter what you are able to do with this code. The end result will be a set of two folders, `data` and `dataframes` which you can use for analysis of Wikipedia. 

The code has now been altered on my end in several ways: 
- use and report curl from special export to get a complete history of a page. 
- considerably expanded reporting and commenting.
- new arguments available to the script include --count_only 

There is also now a second script available `xml_to_dataframe.py` which can be used to then process these files and turn them into separate DataFrames. These DataFrames are stored as .feather files and can be loaded with the code below. 

You should review the `xml_to_dataframe.py` file as all the operations within that file have been covered in class with the exception of TQDM but you can see how that works in practice. 

You will note that this version does not use recursion to count the files. Instead it more literally looks within year and month. This is sufficient for this work, but with a deeper folder structure and one where the structure is less certain this approach would not be robust. On the other hand, by assuming year and month it allows for some interesting statistics about the year and month to be displayed. In your own work you may now consider whether to approach a task with a more general but often more abstract solution or a more specific but often more fragile solution. You can see in Jon's solution that he used a clever way to simply count all the files using a global and letting the global handle the recursion (`download_and_count_revisions_solution.py`).

You should now be able to download a complete history for a single wikipedia page and process that as a DataFrame. Confirm that you can do this with the code yourself. Then discuss among your group:
1. Which two (or more) public figures are worth comparing and why. 
2. Prior to any specific time series analysis, consider your expectations for this exploratory comparison.  

Draw upon your group's potential expertise in social science to come up with a theoretically informed rationale for a given comparison. 

## Merging in Changes to a Repository 

First you will want to merge files from an upstream branch (mine). These instructions will show how to do that from the terminal. You will want to be in the oii-fsds-wikipedia folder when entering these commands. Note especially **Step 3**. If you do this it will overwrite `download_wiki_revisions.py` so consider making a backup. 

1. **Add the original repository as a remote:**
   ```sh
   git remote add upstream https://github.com/berniehogan/oii-fsds-wikipedia.git
   ```

2. **Fetch the changes from the original repository:**
   ```sh
   git fetch upstream
   ```

3. **Backup any local changes:**
   If you have your own versions of files like `download_wiki_revisions.py`, you should rename the file first to avoid conflicts:
   ```sh
   mv download_wiki_revisions.py download_wiki_revisions_backup.py
   ```

4. **Merge upstream changes into your local main branch:**
   ```sh
   git merge upstream/main
   ```

5. **Resolve any conflicts and commit the changes:**
   You should resolve any conflicts that arise during the merge and then commit the changes:
   ```sh
   git add .
   git commit -m "Merge changes from upstream"
   ```

6. **Push the changes to your GitHub repository:**
   ```sh
   git push origin main
   ```

7. **Test your code after merging:**
   You should test your code to ensure everything works correctly after the integration.

By following these steps, you should be able to integrate the latest changes from my repository while preserving your own custom modifications.

Once this is done, you can use the script below if you wish in order to run the commands directly within a Jupyter notebook rather than via that terminal. 

In [1]:
import os
import pandas as pd

# Define articles we want to download
article1 = "BTS"
article2 = "Taylor Swift"

# Create necessary directories if they don't exist
os.makedirs("data", exist_ok=True)
os.makedirs("DataFrames", exist_ok=True)

# Download revisions for both articles
print("Downloading revisions for first article...")
os.system(f'python download_wiki_revisions.py "{article1}"')
print("\nDownloading revisions for second article...")
os.system(f'python download_wiki_revisions.py "{article2}"')

# Convert all downloaded revisions to DataFrames
print("\nConverting revisions to DataFrames...")
os.system('python xml_to_dataframe.py --data-dir ./data --output-dir ./DataFrames')

# Load and verify one of the DataFrames
print("\nVerifying DataFrame contents...")
df = pd.read_feather(f"DataFrames/{article1}.feather")

# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

# Display some basic statistics
print("\nBasic statistics:")
print(f"Total number of revisions: {len(df)}")
print(f"Date range: from {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Number of unique editors: {df['username'].nunique()}")

Downloading revisions for first article...


Traceback (most recent call last):
  File "/Users/priyansha/Documents/Oxford/Term1/SDS_in_python/Week02/oii-fsds-wikipedia/download_wiki_revisions.py", line 5, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'



Downloading revisions for second article...


Traceback (most recent call last):
  File "/Users/priyansha/Documents/Oxford/Term1/SDS_in_python/Week02/oii-fsds-wikipedia/download_wiki_revisions.py", line 5, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'



Converting revisions to DataFrames...


Traceback (most recent call last):
  File "/Users/priyansha/Documents/Oxford/Term1/SDS_in_python/Week02/oii-fsds-wikipedia/xml_to_dataframe.py", line 4, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'



Verifying DataFrame contents...


FileNotFoundError: [Errno 2] No such file or directory: 'DataFrames/BTS.feather'

In [2]:
import pandas as pd

In [4]:
BTS_df = pd.read_feather(f"DataFrames/BTS.feather")

In [5]:
TaylorSwift_df = pd.read_feather(f"DataFrames/Taylor Swift.feather")

In [10]:
display(BTS_df)

Unnamed: 0,revision_id,timestamp,username,userid,comment,text_length,year,month
3592,909353379,2019-08-04 21:33:10+00:00,Hahahey568,23746076,,199129,2019,08
3588,909353028,2019-08-04 21:30:29+00:00,Hahahey568,23746076,/* Impact and influence */ princes of pop,198810,2019,08
3589,909167621,2019-08-03 15:55:31+00:00,Hahahey568,23746076,,198073,2019,08
3593,909167138,2019-08-03 15:51:04+00:00,Hahahey568,23746076,,198027,2019,08
3591,909157666,2019-08-03 14:24:51+00:00,Hahahey568,23746076,/* Impact and influence */ UN SDG,197960,2019,08
...,...,...,...,...,...,...,...,...
48,562879054,2013-07-04 20:19:54+00:00,,,,7146,2013,07
77,562876787,2013-07-04 19:59:53+00:00,Hinorisakamachi,15926446,/* Music videos */,7270,2013,07
95,562876719,2013-07-04 19:59:17+00:00,Hinorisakamachi,15926446,/* Videography */,7270,2013,07
29,562875379,2013-07-04 19:47:39+00:00,Hinorisakamachi,15926446,/* Singles */,7670,2013,07


In [11]:
display(TaylorSwift_df)

Unnamed: 0,revision_id,timestamp,username,userid,comment,text_length,year,month
727,494850573,2012-05-28 22:14:55+00:00,Popeye191,6138283,,154848,2012,05
995,494849454,2012-05-28 22:06:14+00:00,Popeye191,6138283,,154836,2012,05
850,494848715,2012-05-28 22:00:58+00:00,Popeye191,6138283,,154842,2012,05
673,494848291,2012-05-28 21:58:02+00:00,Popeye191,6138283,added to intro,154825,2012,05
604,494846912,2012-05-28 21:49:01+00:00,Popeye191,6138283,"/* 2008–10: Fearless release, MTV VMA incident...",154677,2012,05
...,...,...,...,...,...,...,...,...
1752,57758435,2006-06-09 20:06:38+00:00,Db3811,1558618,,862,2006,06
1755,57634230,2006-06-09 02:17:16+00:00,TantalumTelluride,498945,cats,867,2006,06
1750,57630878,2006-06-09 01:47:45+00:00,TantalumTelluride,498945,wikify,631,2006,06
1754,56860230,2006-06-04 18:26:58+00:00,Db3811,1558618,,516,2006,06


In [38]:
def quick_wikipedia_stats(df: pd.DataFrame):
    print(f"Total number of revisions: {len(df)}")
    print(f"Date range: from {df['timestamp'].min()} to {df['timestamp'].max()}")
    print(f"Number of unique editors: {df['username'].nunique()}")
    print(f"Average number of revisions per editor: {len(df)/df['username'].nunique():.2f}")
    print(f'Minimum and Max number of revisions by a single editor: {df["username"].value_counts().min()}, {df["username"].value_counts().max()}')
    print(f"Number of Nullvalues in comments: {df['comment'].isnull().sum()}")
    print("")
    print(f"Percentage of null Values in comments: {df['comment'].isnull().sum()/len(df)*100:.2f}%")
    print(f"Number and percentage of *bot* in username: {df['username'].str.contains('bot', case=False).sum()}, {df['username'].str.contains('bot', case=False).sum()/len(df)*100:.2f}%")
    print(f"Number and percentage of *bot* in comments: {df['comment'].str.contains('bot', case=False).sum()}, {df['comment'].str.contains('bot', case=False).sum()/len(df)*100:.2f}%")
    print(f"Number and percentage of *vandal* in comments: {df['comment'].str.contains('vandal', case=False).sum()}, {df['comment'].str.contains('vandal', case=False).sum()/len(df)*100:.2f}%")

In [39]:
for df in [BTS_df, TaylorSwift_df]:
    print("\n")
    quick_wikipedia_stats(df)



Total number of revisions: 5307
Date range: from 2013-07-04 19:45:15+00:00 to 2019-08-04 21:33:10+00:00
Number of unique editors: 767
Average number of revisions per editor: 6.92
Minimum and Max number of revisions by a single editor: 1, 229
Number of Nullvalues in comments: 1008

Percentage of null Values in comments: 18.99%
Number and percentage of *bot* in username: 201, 3.79%
Number and percentage of *bot* in comments: 176, 3.32%
Number and percentage of *vandal* in comments: 141, 2.66%


Total number of revisions: 5224
Date range: from 2006-06-04 18:26:39+00:00 to 2012-05-28 22:14:55+00:00
Number of unique editors: 1040
Average number of revisions per editor: 5.02
Minimum and Max number of revisions by a single editor: 1, 733
Number of Nullvalues in comments: 1051

Percentage of null Values in comments: 20.12%
Number and percentage of *bot* in username: 129, 2.47%
Number and percentage of *bot* in comments: 102, 1.95%
Number and percentage of *vandal* in comments: 91, 1.74%


In [None]:
# Trial edit