# WikiPage fetcher and cleaner notebook

_This notebook is dedicated towards WikiPages, fetching them and storing. From navigating the Harry Potter Fandom wiki, to downloading the pages and handling redirects._

**Some imports**:

In [1]:
import urllib.request as urllib2
import urllib
import re
import json
import pandas as pd
import numpy as np
import os
import mwparserfromhell
import pickle

In [2]:
# Defining home directory
hd = os.getcwd()
print(hd)

/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis


---

# Setup

_Defining the books, and fetching the WikiPages first appearing in these books._


In [3]:
# Creating a list of all the wikilinks of characters and another list of in which book each character appeared first
wikilinks = []
book_nrs = []

books = ['Harry_Potter_and_the_Philosopher%27s_Stone_(character_index)', 
'Harry_Potter_and_the_Chamber_of_Secrets_(character_index)', 
'Harry_Potter_and_the_Prisoner_of_Azkaban_(character_index)',
'Harry_Potter_and_the_Goblet_of_Fire_(character_index)',
'Harry_Potter_and_the_Order_of_the_Phoenix_(character_index)',
'Harry_Potter_and_the_Half-Blood_Prince_(character_index)',
'Harry_Potter_and_the_Deathly_Hallows_(character_index)']

for book in books:

    # Get characters from each book
    baseurl = 'https://harrypotter.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=main&format=json&utf8=1&titles='
    title = book
    wikiresponse_list = urllib2.urlopen(baseurl+title) # generate the complete query url
    wikihtml_list = wikiresponse_list.read().decode("utf-8") # Decode the results
    wikijson = json.loads(wikihtml_list) # Load the json 

    # Fetching the book text from json
    text = wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['revisions'][0]['slots']['main']['*']

    # Extracting all the links from the text
    links = re.findall(r"\[\[(.*?)\]\]", text)
    
    # Cleaning the links
    links_clean = [l.split('|')[0] for l in links if l.startswith(('File', 'nl:', 'Category:', 'pl:', 'ru:', 'Harry Potter and the ')) == False]
    
    # Determining book number
    book_nr = [books.index(book)+1 for i in range(len(set(links_clean)))]
    
    # Adding indication which book the character first appeared in
    wikilinks.extend(set(links_clean))
    book_nrs.extend(book_nr)


In [4]:
# Creating a dataframe for wikis and their respective book numbers

df_wiki = pd.DataFrame(list(zip(wikilinks,book_nrs)), columns=['wiki','book'])
len(df_wiki)

810

In [5]:
df_wiki.head(3)

Unnamed: 0,wiki,book
0,Tufty,1
1,Mirror of Erised,1
2,Quirinus Quirrell's first mountain troll,1


In [6]:
# We have a few duplicates that we need to drop. 

# We like to keep the first instance where the characters appeared
df_wiki = df_wiki.drop_duplicates(subset=['wiki'], keep='first')

len(df_wiki)

789

---

# Fetching WikiPages

_Now that the dataframe is set up, we fetch the pages from the Fandom page. We also have to handle redirects and make sure we're storing the correct data. The pages are stored in their respective book's folders (book1 - book7), as text files._

In [7]:
# Surpress warning message for SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Now we must loop through all the character wikis and download all of their text into files
folders = ['book1','book2','book3', 'book4', 'book5', 'book6', 'book7']
for folder in folders:
    if not os.path.exists(folder):
        os.mkdir(folder)

# Define the working directory
wd = hd

# Keep information about which links redirect -> 
# That way we can make sure the information is displayed as correctly as possible
df_wiki['alternative_wiki'] = None

for idx, row in df_wiki.iterrows():
    # Check if working directory is correct based on the next row - we want to save each character into their book's folder
    expected_path = wd + '/book' + str(row['book'])
    if os.getcwd() != expected_path:
        print("Setting working directory")
        print(expected_path)
        os.chdir(expected_path)

    # Get info about the character
    baseurl = 'https://harrypotter.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=main&format=json&utf8=1&titles='
    title = urllib.parse.quote(row.wiki)
    file_name = row.wiki

    print("Book: ", row.book, " Character: ", row.wiki)

    wikiresponse = urllib2.urlopen(baseurl+title) # generate the complete query url
    wikihtml = wikiresponse.read().decode("utf-8") # Decode the results
    wikijson = json.loads(wikihtml) # Load the json 

    # Extract the page text from the json, these are located at different places 
    try:
        text = wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['revisions'][0]['slots']['main']['*']
    except:
        try:
            text = wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['revisions'][0]['*']
        except:
            if wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['missing'] == '':
                text = ''
   
    # Before we continue we must check if there is a redirect on the text - otherwise we will get no info on the character
    if text.startswith('#REDIRECT'):
        
        link = re.findall(r"\[\[(.*?)\]\]", text)
        new_title = urllib.parse.quote(link[0])
        print("WAS REDIRECTED TO: ", link[0])

        #Redo the steps above with the new_title
        wikiresponse = urllib2.urlopen(baseurl+new_title) # generate the complete query url
        wikihtml = wikiresponse.read().decode("utf-8") # Decode the results
        wikijson = json.loads(wikihtml) # Load the json

        try:
            text = wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['revisions'][0]['slots']['main']['*']
        except:
            try:
                text = wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['revisions'][0]['*']
            except:
                if wikijson['query']['pages'][next(iter(wikijson['query']['pages']))]['missing'] == '':
                    text = ''

        # Update the wiki redirects to
        df_wiki.at[idx, 'alternative_wiki'] = link[0]
        file_name = link[0]

    # Create a txt file of the text contained on the wikipage
    file = open(file_name + '.txt', "w") 
    file.write(text) 
    file.close() 

# going back to home directory
os.chdir(hd)

Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book1
Book:  1  Character:  Tufty
Book:  1  Character:  Mirror of Erised
Book:  1  Character:  Quirinus Quirrell's first mountain troll
Book:  1  Character:  Owner of the Railview Hotel
Book:  1  Character:  George Weasley
Book:  1  Character:  Ronald Weasley
Book:  1  Character:  Fluffy
Book:  1  Character:  Cliodna
Book:  1  Character:  Arabella Figg
Book:  1  Character:  Gellert Grindelwald
Book:  1  Character:  Quirinus Quirrell
Book:  1  Character:  Peter Pettigrew
Book:  1  Character:  Neville Longbottom
Book:  1  Character:  Vernon Dursley's secretary
Book:  1  Character:  Circe
Book:  1  Character:  Gordon
Book:  1  Character:  Draco Malfoy
Book:  1  Character:  Mandy Brocklehurst
Book:  1  Character:  Violet-cloaked wizard
Book:  1  Character:  Albus Dumbledore
Book:  1  Character:  Piers Polkiss's mother
Book:  1  Character:  Theodore Nott
Book:

Book:  2  Character:  Hermione Granger's mother
Book:  2  Character:  Ghost
Book:  2  Character:  Dobby's Bludger
WAS REDIRECTED TO:  Rogue bludger
Book:  2  Character:  pixie
Book:  2  Character:  Fawkes
Book:  2  Character:  Martin Miggs
Book:  2  Character:  Bozo
Book:  2  Character:  Gladys Gudgeon
Book:  2  Character:  Demetrius Prod
Book:  2  Character:  Nicholas de Mimsy-Porpington's five-hundredth Deathday Party
Book:  2  Character:  Mortlake
Book:  2  Character:  Mason
Book:  2  Character:  Z. Nettles
Book:  2  Character:  Wailing Widow
Book:  2  Character:  Bandon Banshee
Book:  2  Character:  Borgin
Book:  2  Character:  Fire Dwelling Salamander
Book:  2  Character:  Olive Hornby
Book:  2  Character:  Mudblood
Book:  2  Character:  Vampire
Book:  2  Character:  Weasley family
Book:  2  Character:  Veronica Smethley
Book:  2  Character:  Manager of Flourish and Blotts
Book:  2  Character:  Perkins
Book:  2  Character:  Horcrux
Book:  2  Character:  Armando Dippet
Book:  2  Ch

Book:  4  Character:  Stewart Ackerley
Book:  4  Character:  Archie Aymslowe
Book:  4  Character:  Summers
Book:  4  Character:  Pyotr Vulchanov
Book:  4  Character:  Blast-Ended Skrewt
Book:  4  Character:  Poliakoff
Book:  4  Character:  Wilkes
Book:  4  Character:  Unidentified Hufflepuff girl (III)
Book:  4  Character:  Winky
Book:  4  Character:  Bertha Jorkins
Book:  4  Character:  Eloise Midgen
Book:  4  Character:  Niffler
Book:  4  Character:  Kevin Whitby
Book:  4  Character:  Spider
Book:  4  Character:  Ludovic Bagman's father
Book:  4  Character:  Amos Diggory
Book:  4  Character:  Stebbins (Potter-era)
Book:  4  Character:  Gabrielle Delacour
Book:  4  Character:  Olympe Maxime
Book:  4  Character:  Mulciber (Marauder-era)
WAS REDIRECTED TO:  Mulciber II
Book:  4  Character:  Gilbert Wimple
Book:  4  Character:  Dot
Book:  4  Character:  Apollyon Pringle
Book:  4  Character:  Mary Riddle
Book:  4  Character:  Augustus Rookwood
Book:  4  Character:  Payne
Book:  4  Charact

Book:  6  Character:  Arkie Philpott
Book:  6  Character:  Hepzibah Smith
Book:  6  Character:  Mulciber (Riddle-era)
WAS REDIRECTED TO:  Mulciber I
Book:  6  Character:  Romilda Vane
Book:  6  Character:  Marvolo Gaunt
Book:  6  Character:  Eloise Midgen's father
Book:  6  Character:  Prime Minister's political opponent
Book:  6  Character:  Rupert Brookstanton
Book:  6  Character:  Abraxas Malfoy
Book:  6  Character:  Verity
Book:  6  Character:  Amycus Carrow
Book:  6  Character:  Avery (Riddle-era)
WAS REDIRECTED TO:  Avery I
Book:  6  Character:  Herbert Chorley
Book:  6  Character:  Libatius Borage
Book:  6  Character:  Demelza Robins
Book:  6  Character:  Prime Minister's niece
Book:  6  Character:  Muriel
Book:  6  Character:  Marcus Belby's father
Book:  6  Character:  Katie Bell's mother
Book:  6  Character:  Cormac McLaggen
Book:  6  Character:  Harper
Book:  6  Character:  Lestrange
Book:  6  Character:  Martha (disambiguation)
Book:  6  Character:  Octavius Pepper
Book:  6

In [8]:
df_wiki.head(3)

Unnamed: 0,wiki,book,alternative_wiki
0,Tufty,1,
1,Mirror of Erised,1,
2,Quirinus Quirrell's first mountain troll,1,


**Now we have stored all the WikiPages. But the dataframe should contain the necessary information on alternative Wikis**:

In [9]:
# If the redirect links already exist in the dataframe we need to drop them

alternative_wikis = set(list(df_wiki.alternative_wiki))
df_wiki[df_wiki.wiki.isin(alternative_wikis)]

Unnamed: 0,wiki,book,alternative_wiki
356,Alastor Moody,4,


In [10]:
# So we override the alternative_wiki link for this row 
df_wiki[df_wiki.wiki == 'Alastor Moody']['alternative_wiki'] = 'Alastor "Mad-Eye" Moody'

#Finally we drop the row containing 'Alastor "Mad-Eye" Moody' in the wiki
df_wiki = df_wiki[df_wiki.wiki != 'Alastor "Mad-Eye" Moody']

In [11]:
len(df_wiki)

788

In [12]:
df_wiki[df_wiki.wiki == 'Alastor Moody']

Unnamed: 0,wiki,book,alternative_wiki
356,Alastor Moody,4,


In [13]:
# Making sure we're in the wiki home directory

os.chdir(hd)

print(os.getcwd())

/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis


In [14]:
# Creating a new folder to store the dataframe and switching to it:
if not os.path.exists("dataframes"):
    os.mkdir("dataframes")
os.chdir(hd + "/dataframes")

# And fetching the files is done, saving the dataframe:
with open('original_wiki_df.pkl', 'wb') as f:
    pickle.dump(df_wiki, f)

---

# WikiPage DataFrame cleaning and sorting

_Now that we have all the files, we have to extract the valuable information from them. This is so we can build a network later._

In [15]:
# Start by fetching the dataframe from above and changing the directory:

os.chdir(hd)

# Fetching the original dataframe

with open('dataframes/original_wiki_df.pkl', 'rb') as f:
    df_wiki = pickle.load(f)

In [16]:
# Seeing if it loaded correctly 

df_wiki.head(3)

Unnamed: 0,wiki,book,alternative_wiki
0,Tufty,1,
1,Mirror of Erised,1,
2,Quirinus Quirrell's first mountain troll,1,


_Now, we loop through the DataFrame that contains information about the WikiPages, and open the files that correspond with the row. Then, we fetch the links that are contained in the page, and various attributes such as the Hogwarts House, family and others._

In [17]:
# Now we need to loop through all the files and extract information to add to the dataframe

# Surpress warning message for SettingWithCopyWarning
pd.options.mode.chained_assignment = None

folders = ['book1','book2','book3', 'book4', 'book5', 'book6', 'book7']

# Define the working directory
wd = hd

# Add columns to the dataframe for the additional information
df_wiki['text'] = None
df_wiki['links'] = None
df_wiki['house'] = None
df_wiki['type'] = None
df_wiki['blood'] = None
df_wiki['job'] = None
df_wiki['family'] = None
df_wiki['loyalties'] = None

# Definitions for later
wiki_links = list(df_wiki.wiki)
alternative_links = list(df_wiki.alternative_wiki[df_wiki.alternative_wiki.notna()])
all_wiki_links = wiki_links + alternative_links

for idx, row in df_wiki.iterrows():
    # Check if working directory is correct based on the next row - we want to save each character into their book's folder
    expected_path = wd + '/book' + str(row['book'])
    if os.getcwd() != expected_path:
        print("Setting working directory")
        print(expected_path)
        os.chdir(expected_path)

    # Check if there was an alternative link for this character
    file_name = row.wiki
    if row.alternative_wiki != None:
        file_name = row.alternative_wiki

    # Opening the file containing the WikiPage
    file = open(file_name + '.txt', "r") 
    text = file.read()
    file.close() 

    # Print info so easier to see where loop is running
    #print("Book: ", row.book, " Character: ", file_name)

    # Finding the links in the entity's WikiPage
    links = re.findall(r"\[\[(.*?)\]\]", text)
    links_clean = [l.split('|')[0] for l in links if l.startswith(('File', 'nl:', 'Category:', 'pl:', 'ru:', 'Harry Potter and the ')) == False]
    
    # filter all the links based on characters in the dataframe
    wiki_links_clean = [l for l in links_clean if l in wiki_links]
    alternative_links_clean = [l for l in links_clean if l in alternative_links]

    # Clean up and overwrite link values where there are alternative wikis
    for i in range(len(alternative_links_clean)):
        alternative_links_clean[i] = df_wiki[df_wiki.alternative_wiki==alternative_links_clean[i]].wiki.item()

    # Combine both links into a single set
    clean_links = list(set(wiki_links_clean + alternative_links_clean))

    # Storing the WikiPage text and its links in the DataFrame
    df_wiki.at[idx, "text"] = text
    df_wiki.at[idx, "links"] = clean_links

    # We also want to extract more information from the text of each character.
    # Source for info: https://github.com/earwig/mwparserfromhell
    wikicode = mwparserfromhell.parse(text) # Convert to wikicode
    templates = wikicode.filter_templates() # Extract the templates of the wikicode

    # Checking whether the WikiPage is for an individual
    if len([t for t in templates if t.startswith('{{Individual infobox')]) > 0:
        
        # If the WikiPage is for an individual, we can start fetching information about the individual
        template = [t for t in templates if t.startswith('{{Individual infobox')][0] # Find the template containing the infobox

        # Extract the relevant information from the template
        if template.has_param('blood'):
            blood = re.findall(r"\[\[(.*?)\]\]", str(template.get('blood').value))
            #print(blood)
        if template.has_param('nationality'):
            nationality = re.findall(r"\[\[(.*?)\]\]", str(template.get('nationality').value))
            nationality = [n.split('|')[0] for n in nationality]
            #print(nationality)
        if template.has_param('species'):
            species = re.findall(r"\[\[(.*?)\]\]", str(template.get('species').value))
            #print(species)
        if template.has_param('gender'):
            gender = [template.get('gender').value.strip().split('<')[0]]
            #print(gender)
        if template.has_param('hair'):
            hair = [template.get('hair').value.strip().split('<')[0]]
            #print(hair)
        if template.has_param('eyes'):
            eyes = [template.get('eyes').value.strip().split('<')[0]]
            #print(eyes)
        if template.has_param('skin'):
            skin = [template.get('skin').value.strip().split('<')[0]]
            #print(skin)
        
        # Finding whether the character has family ties
        if template.has_param('family'):
            family_raw = re.findall(r"\[\[(.*?)\]\]", str(template.get('family').value))
            family = [l.split('|')[0] for l in family_raw]

        if template.has_param('house'):
            house = re.findall(r"\[\[(.*?)\]\]", str(template.get('house').value))
            
        if template.has_param('job'):
            job_raw = re.findall(r"\[\[(.*?)\]\]", str(template.get('job').value))
            job = [l.split('|')[0] for l in job_raw]
            
        if template.has_param('loyalty'):
            loyalties_raw = re.findall(r"\[\[(.*?)\]\]", str(template.get('loyalty').value))
            loyalties = [l.split('|')[0] for l in loyalties_raw]

        # Other topics: boggart, patronus

        # Write the relevant values into the dataframe
        if len(house) > 0:
            df_wiki.at[idx, "house"] = house[0]
            
        if len(job) > 0:
            df_wiki.at[idx, "job"] = job[0]
            
        if len(family) > 0:
            df_wiki.at[idx, "family"] = family
            
        if len(loyalties) > 0:
            df_wiki.at[idx, "loyalties"] = loyalties
            
        if len(blood) > 0:
            df_wiki.at[idx, "blood"] = blood
        
        # Finally, we're declaring the character as an individual
        df_wiki.at[idx, "type"] = "Individual"
    

    # If the WikiPage is not for an individual, here we store what type the entity is    
    elif len([t for t in templates if t.startswith('{{Object infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Object"
        
    elif len([t for t in templates if t.startswith('{{Pet individual infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Pet"
        
    elif len([t for t in templates if t.startswith('{{Creature infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Creature"
        
    elif len([t for t in templates if t.startswith('{{Letter_infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Letter"      

    elif len([t for t in templates if t.startswith('{{Family infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Family"      

    elif len([t for t in templates if t.startswith('{{Location infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Location"
        
    elif len([t for t in templates if t.startswith('{{Battle infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Battle" 

    elif len([t for t in templates if t.startswith('{{Quidditch Team infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Quidditch Team" 
        
    elif len([t for t in templates if t.startswith('{{Organisation infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Organisation" 
        
    elif len([t for t in templates if t.startswith('{{Potion infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Potion" 
        
    elif len([t for t in templates if t.startswith('{{Plant infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Plant" 
        
    elif len([t for t in templates if t.startswith('{{Spell infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Spell" 
    
    elif len([t for t in templates if t.startswith('{{Event infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Event" 
        
    elif len([t for t in templates if t.startswith('{{Official post')]) > 0:
        df_wiki.at[idx, "type"] = "Official post" 
        
    elif len([t for t in templates if t.startswith('{{School infobox')]) > 0 or len([t for t in templates if t.startswith('{{School_infobox')]) > 0:
        df_wiki.at[idx, "type"] = "School" 
        
    elif len([t for t in templates if t.startswith('{{Horcrux infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Horcrux" 
        
    elif len([t for t in templates if t.startswith('{{Class infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Class" 
        
    elif len([t for t in templates if t.startswith('{{Book infobox')]) > 0:
        df_wiki.at[idx, "type"] = "Book" 
        
    else:
        df_wiki.at[idx, "type"] = "Unknown"
    
# reset wd
os.chdir(wd)


Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book1
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book2
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book3
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book4
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book5
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book6
Setting working directory
/Users/teddi/Documents/DTU/FALL2020/Social Graphs & Interactions/Final Project/New segmented project/Wikis/book7


**Blood-type is often ambiguous - let's assume that if the following things are mentioned, they're true (in the row that they appear):**

In [18]:
for idx, row in df_wiki.iterrows():
    if row.blood:
        blood = [x.lower() for x in row.blood]
        if 'non-magic people|muggle' in blood:
            df_wiki.at[idx, 'blood'] = "Muggle"
        elif 'muggle-born' in blood:
            df_wiki.at[idx, 'blood'] = "Muggle-Born"
        elif 'pure-blood' in blood:
            df_wiki.at[idx, 'blood'] = "Pure-Blood"
        elif 'half-blood' in blood:
            df_wiki.at[idx, 'blood'] = "Half-Blood"
        else:
            df_wiki.at[idx, 'blood'] = "Unknown"
    else:
        df_wiki.at[idx, 'blood'] = "Unknown"

---

In [20]:
# Let's see if the information was stored

df_wiki.head(3)

Unnamed: 0,wiki,book,alternative_wiki,text,links,house,type,blood,job,family,loyalties
0,Tufty,1,,{{Pet individual infobox\n|image = \n|name = T...,"[Harry Potter, Mr Tibbles, Dudley Dursley, Ara...",,Pet,Unknown,,,
1,Mirror of Erised,1,,{{Spoiler|PAS|WU}}\n{{Object infobox\n|name = ...,"[Argus Filch, Percival Dumbledore, Ronald Weas...",,Object,Unknown,,,
2,Quirinus Quirrell's first mountain troll,1,,{{Youmay|the troll brought in on [[Hallowe'en]...,"[Harry Potter, Parvati Patil, Minerva McGonaga...",,Pet,Unknown,,,


# Success!

_The different attributes are now stored in the dataframe, and we can start analysing the different entities fetched from the WikiPage._

**Storing this version of the DataFrame:**

In [21]:
os.chdir(hd)

# Creating a new folder to store the dataframe and switching to it:
if not os.path.exists("dataframes"):
    os.mkdir("dataframes")
os.chdir(hd + "/dataframes")

# And fetching the files is done, saving the dataframe:
with open('cleaned_wiki_df.pkl', 'wb') as f:
    pickle.dump(df_wiki, f)

---