### CUNY Data 620 - Web Analytics, Summer 2020
Attempt to Generate Bipartite network from Wikipedia page

### Data Set
* **Source**: [Wiki page for official languages by country/territory](https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory)
* **Format**: Wikipedia table scraped for information
* **Description**: A table of countries and each language associated with that country whther in an official capacity or simly widely spoken. 
* **Nodes**: 207 countries, 303 languages
* **Edges**: 665 pairings


### Importing Packages

In [1]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import networkx as nx
from networkx.algorithms import bipartite

### Scraping Data

In [2]:
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory").text

soup = BeautifulSoup(website_url, 'lxml')
#print(soup.prettify())
lang_table = soup.find('table',{'class':'wikitable sortable'})

### Custom Functions

In [3]:
### Identify the name of the item within the XML line in question ###
def find_names(x):
    pattern = ">(\S+)</a"
    word = str(x)
    start = [m.start() for m in re.finditer(pattern, word)]
    end = [m.end() for m in re.finditer(pattern, word)]
    result = ''
    for i in range(0,len(start)):
        temp = x[(start[i]+1):(end[i]-3)]
        
        result = result + (temp)
    return (result)

### Determine if there are bullets within the table cell and pull languages for each ###
### If not, simply add the language to the list, ignoring "names" shorter than 3 characters as they are likely erroneous ###
def lists_of_langs(td,col,typ,langs):
    if str(td).find("<ul>") > 0:
        lis = td.findAll('li')
        for li in lis:
            lang = li.get_text()
            add_lang(country,lang,typ,langs)
    else:
        lang = td.get_text()
        if len(lang)>3:
            add_lang(country,lang,typ,langs)
    
### Add a row to the language dataframe for the country, language and language type supplied ###
def add_lang(count,lang,typ,lis):
    lis.append([count, lang, typ])
    
### Remove footnotes, parentheses and trailing spaces from text ###
def clean_name(x):
    if x.endswith("]"):
        x = re.sub(r'\[.*\]', '', x)
    start = [m.start() for m in re.finditer("\(", x)]
    if len(start) > 0:
        x = x[:start[0]]
    if x.endswith("%"):
        x = re.sub(r'\d+\%', '', x)
    return x.strip()
    

Create an empty list and run through each 'td' of the XML code. Each 'td' coincides with a cell in the table, so we can predict which column each one is in and determine the type of language found.

In [4]:
langs = []
trs = lang_table.findAll('tr')
counter = 1
for tr in trs:
    tds = tr.findAll('td')
    col = 0
    for td in tds:
        if col == 0:
            country = td.get_text()
            #country = find_names(str(td))
        elif col == 1:
            typ = "Official"
            lists_of_langs(td,col,typ,langs)
        elif col == 2:
            typ = "Regional"
            lists_of_langs(td,col,typ,langs)
        elif col == 3:
            typ = "Minority"
            lists_of_langs(td,col,typ,langs)
        elif col == 4:
            typ = "National"
            lists_of_langs(td,col,typ,langs)
        elif col == 1:
            typ = "Widely Spoken"
            lists_of_langs(td,col,typ,langs)
        col += 1
    counter += 1
    

Convert the list into a pandas dataframe, name the columns and clean up each one. 

In [5]:
col_names = ['Country','Language','Type']
langs_df = pd.DataFrame(langs, columns = col_names)
Language = set(langs_df["Language"])
langs_df['Country'] = langs_df['Country'].map(clean_name)
langs_df['Country'] = langs_df['Country'].map(clean_name)

langs_df['Language'] = langs_df['Language'].map(clean_name)
langs_df

Unnamed: 0,Country,Language,Type
0,Abkhazia,Abkhaz,Official
1,Abkhazia,Russian,Official
2,Abkhazia,Georgian,Minority
3,Abkhazia,Abkhaz,National
4,Afghanistan,Pashto,Official
...,...,...,...
660,Zimbabwe,Sotho,Official
661,Zimbabwe,Tonga,Official
662,Zimbabwe,Tswana,Official
663,Zimbabwe,Venda,Official


Assign weights to each language based on the type. 

In [7]:
### Define an edge weight that will be associated with each language type
def type_weight(row):
    if row['Type'] == "Official":
        return 30
    elif row['Type'] == "Regional":
        return 25
    elif row['Type'] == "Minority":
        return 20
    elif row['Type'] == "National":
        return 15
    elif row['Type'] == "Widely Spoken":
        return 10

In [8]:
langs_df["Weights"] = langs_df.apply (type_weight,axis =1)
#Rename languages with identical names to countries
langs_df.loc[langs_df['Language'] == "Kiribati", 'Language'] = "Kiribati Language"
langs_df.loc[langs_df['Language'] == "Tonga", 'Language'] = "Tonga Language"
langs_df

Unnamed: 0,Country,Language,Type,Weights
0,Abkhazia,Abkhaz,Official,30
1,Abkhazia,Russian,Official,30
2,Abkhazia,Georgian,Minority,20
3,Abkhazia,Abkhaz,National,15
4,Afghanistan,Pashto,Official,30
...,...,...,...,...
660,Zimbabwe,Sotho,Official,30
661,Zimbabwe,Tonga Language,Official,30
662,Zimbabwe,Tswana,Official,30
663,Zimbabwe,Venda,Official,30


In [16]:
langs_df['Country'] = langs_df['Country'].map(clean_name)
langs_df['Language'] = langs_df['Language'].map(clean_name)
Countries = list(set(langs_df["Country"]))
Languages = list(set(langs_df["Language"]))
#Countries

Create an empty list 'Edges' and populate it by going down the line of the dataframe and generating tuples containing the Country, Language and Weight. 

In [14]:
Edges = []
for i in range(0,len(langs_df)):
    tup = (langs_df.loc[i]['Country'],langs_df.loc[i]['Language'],langs_df.loc[i]['Weights'])
    Edges.append(tup)
len(Edges)

665

Attempt to create a bipartite network from this data. 

In [11]:
nCountr=len(Countries)
nLangs=len(Languages)
g=nx.generators.empty_graph(nCountr+nLangs)
g.clear()
g.name="World Languages by Country"
g.add_weighted_edges_from(Edges)

In [12]:
nx.is_bipartite(g)

True