### Text Classification 
Goal: to gain experience with natural language processing libraries like nltk and spacy, and simple prediction models in sklearn

Project Scope: 
This project aims to predict an individual's nationality based on information from her/his Wikipedia bio. One challenge is that the Wikipedia training set nationality fields can be noisy, so one goal is to use nltk or spacy to clean the target field (rather than using manual techniques).


In [29]:
import numpy as np
import pandas as pd
from string import punctuation

import nltk
# from nltk.stem import SnowballStemmer  #using lemmatization instead
# sbs = SnowballStemmer("english")
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# !pip install spacy
# !python -m spacy download en_core_web_sm
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
import time

In [2]:
RANDOM_SEED = 655

### Initial Data Processing

In [84]:
nationality_df = pd.read_csv('bio_nationality.tsv.gz', sep='\t', compression='gzip', index_col=0)
nationality_df.dropna(inplace = True)  #### skip empty entries (drop 11)
print(f"The dataset has {nationality_df.shape[0]} records")
nationality_df.sample(3)

The dataset has 319358 records


Unnamed: 0,bio,nationality
266787,Markus Irle (born 5 February 1976) is an Austr...,austrian
63181,Dino Domenico Natali (better known as Dino Nat...,"united states, american"
313518,Life\nIn 1924 Bausch was co-founder of the Pro...,german


In [91]:
round(nationality_df.nationality.value_counts().head(10)/nationality_df.shape[0]*100,2)

nationality
american                   13.51
british                     5.21
indian                      3.37
united states, american     3.07
australian                  2.59
french                      2.31
german                      2.08
italian                     1.69
japanese                    1.64
canadian                    1.59
Name: count, dtype: float64

Exploratory observations:
* there is some inconsistency in the way nationalities are represented, e.g., "american" vs "united states, american".
* some nationality data is garbled or lists more than one nationality, e.g., "norway, norwegian and germany, german" or "rus) (pol"

In [86]:
# nationality_df.nationality.value_counts().iloc[61:80]

Two techniques are explored to clean the nationality field:
1. manual transformations based of some of the main type of observed issues
2. using spacy named entity recognition directly
3. using a combination of approaches 1 and 2

##### Approach 1

In [96]:
def nationality_cleanup_basic(x, basic = True):
    """This function tried to manually clean the nationality field based on the most common types of errors seen. This is a brute force approach
    and any nuances in the nationality field are lost"""
    
    nationalities = x.split(",")
    if len(nationalities) ==  1:
        _ = nationalities[0].strip()
    elif len(nationalities) ==  2:
        _ = nationalities[1].strip()
    else: 
        _ = None
        
    if _ in ["us", "united states", "usa", "U.S.A."]:
        _ = "american"
    elif _ in ["united kingdom", "uk", "english", "eng"]:
        _ = "british"

    if basic == False and _ == None:
        return x
    else:    return _

start = time.time()
nationality_df["nationality_basic"] = nationality_df.nationality.apply(lambda x: nationality_cleanup_basic(x, basic = True))
print(f"this approach was unable to identify nationality for {round(nationality_df.nationality_basic.isna().sum()/nationality_df.shape[0]*100,2)}% of the records")
nationality_df.sample(5)

this approach was unable to identify nationality for 1.02% of the records


Unnamed: 0,bio,nationality,nationality_basic
85404,"Jennifer Rardin (April 28, 1965 – September 20...",american,american
191167,"Thomas Joris Paul Alizier (born June 20, 1990)...","french people, french",french
264169,Marie-Thérèse Abena Ondoa (nee '''Obama''') is...,cameroonian,cameroonian
98679,Political career\nMnqasela was initially an Af...,south african,south african
242017,Biography\nAlexandra Vela was born to Ecuadori...,flag: ecuador,flag: ecuador


In [92]:
round(nationality_df.nationality_basic.value_counts().head(10)/ nationality_df.shape[0]*100, 2)

nationality_basic
american      21.11
british        9.23
indian         4.01
french         3.04
german         2.75
canadian       2.71
australian     2.70
italian        2.30
japanese       1.95
spanish        1.31
Name: count, dtype: float64

Simple clean up improves the nationality groupings quite a bit and is only unable to clean up about 1% of the records. 

But this approach requires that we know what are some of the common types of errors in the nationality field ahead of time, which is not always possible. Therefore, the second approach, that extracts nationalities directly using spacy is utilized

##### Approach 2

In [99]:
start = time.time()
def nationality_cleanup_spacy(x):
    "additional clean up to extract nationalities. only the first nationality extracted is returned (in cases where many are extracted)"
    
    doc = nlp(x)
    if len(doc) == 1: ##if we only have one nationality specified, we used it directly
        return list({word.lemma_ for word in doc})[0]
        # return {sbs.stem(x)}
        
    else: ##if we have additional information, we try to extract the nationality using spacy NORP ("Nationalities or religious or political groups")
        # https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
        
        NORP_GPE_extract = [(X.text, X.label_) for X in doc.ents]
        nationalities = list({lemmatizer.lemmatize(X[0]) for X in NORP_GPE_extract if X[1] == "NORP"})
        countries = list({lemmatizer.lemmatize(X[0]) for X in NORP_GPE_extract if X[1] == "GPE"})
        # nationalities = {sbs.stem(X[0] for X in NORP_GPE_extract if X[1] == "NORP"}
        # countries = {sbs.stem(X[0]) for X in NORP_GPE_extract if X[1] == "GPE"}
        
        if len(nationalities) >= 1:
            return nationalities[0]
        elif len(countries) >= 1: ##if no nationality information could be effectively extracted, we try to extract country names
            return countries[0]
        else: return None

nationality_df["nationality_spacy_direct"] = nationality_df.nationality.apply(lambda x: nationality_cleanup_spacy(x))
print(f"spacy based cleanup takes ~{round((time.time() - start)/60,2)}m to run")
print(f"this approach was unable to identify nationality for {round(nationality_df.nationality_spacy_direct.isna().sum()/nationality_df.shape[0]*100,2)}% records")
nationality_df.head()

spacy based cleanup takes ~24.82m to run
this approach was unable to identify nationality for 3.17% records


Unnamed: 0,bio,nationality,nationality_basic,nationality_spacy_direct
0,Alain Connes (born 1 April 1947) is a French m...,french,french,french
1,Life\n=== Early life ===\nSchopenhauer's birth...,german,german,german
2,Life and career\nAlfred Nobel at a young age i...,swedish,swedish,swedish
3,"Early life\nAlfred Vogt (both ""Elton"" and ""van...",canadian,canadian,canadian
4,Alfons Maria Jakob (2 July 1884 in Aschaffenbu...,german,german,german


In [102]:
round(nationality_df.nationality_spacy_direct.value_counts().head(10)/ nationality_df.shape[0]*100, 2)

nationality_spacy_direct
american         18.89
british           7.24
indian            4.15
french            3.28
australian        3.14
german            2.98
canadian          2.96
italian           2.47
japanese          2.01
united states     1.67
Name: count, dtype: float64

Spacy fails to identify nationalities for about 3% of the records (compared to 1% via direct transformations) and the results obtain from spacy based nationality extraction also contain some noise (e.g., american and united states are separate entities). But in general, the results are comparable especially when we consider that spacy did not require any prior information on the types of noise observed in the nationality column, and was able to extract the relevant nationalities independently. 

##### Approach 3
Both techniques are combined in a final third approach where we manually correct any obvious/known issues (such as those discussed in in the introduction) and then apply spacy. This approach is able to identify the nationalities for about 98% of the records.

In [105]:
start = time.time()
# nationality_df["nationality_basic2"] = nationality_df.nationality.apply(lambda x: nationality_cleanup_basic(x, basic = False))
# nationality_df.nationality_basic2.isna().sum()
# nationality_df["nationality_spacy"] = nationality_df.nationality_basic2.apply(lambda x: nationality_cleanup_spacy(x))
print(f"spacy based cleanup takes ~{round((time.time() - start)/60,2)}m to run")
print(f"this approach was unable to identify nationality for {round(nationality_df.nationality_spacy.isna().sum()/nationality_df.shape[0]*100,2)}% records")
# nationality_df.to_csv("nationality_df.csv")
nationality_df.head()

spacy based cleanup takes ~55.51m to run
this approach was unable to identify nationality for 1.99% records


Unnamed: 0,bio,nationality,nationality_basic,nationality_spacy_direct,nationality_basic2,nationality_spacy
0,Alain Connes (born 1 April 1947) is a French m...,french,french,french,french,french
1,Life\n=== Early life ===\nSchopenhauer's birth...,german,german,german,german,german
2,Life and career\nAlfred Nobel at a young age i...,swedish,swedish,swedish,swedish,swedish
3,"Early life\nAlfred Vogt (both ""Elton"" and ""van...",canadian,canadian,canadian,canadian,canadian
4,Alfons Maria Jakob (2 July 1884 in Aschaffenbu...,german,german,german,german,german


In [106]:
round(nationality_df.nationality_spacy.value_counts().head(10)/ nationality_df.shape[0]*100, 2)

nationality_spacy
american      21.62
british        9.53
indian         4.14
french         3.27
australian     3.11
german         2.97
canadian       2.93
italian        2.46
japanese       2.00
spanish        1.39
Name: count, dtype: float64

In [119]:
nationalities_freq = round(nationality_df.nationality_spacy.value_counts()/nationality_df.shape[0]*100, 2)
nationalities_freq_keep = nationalities_freq[nationalities_freq>1] ##narrowing it down to nationalities that have atleast 1% presentation in the dataset
nationalities_freq_keep

nationality_spacy
american      21.62
british        9.53
indian         4.14
french         3.27
australian     3.11
german         2.97
canadian       2.93
italian        2.46
japanese       2.00
spanish        1.39
irish          1.30
russian        1.30
mexican        1.25
polish         1.20
dutch          1.15
norwegian      1.12
pakistani      1.11
Name: count, dtype: float64

In [97]:
# doc= nlp("He is an americans from America")#. The cats are blue")
# # [(X.text, X.label_) for X in doc.ents]
# [(lemmatizer.lemmatize(X.text), X.label_) for X in doc.ents]
# # {X[0].lemma_ for X in NORP_GPE_extract if X[1] == "NORP"}



# sbs.stem("american")
# x = "american (naturalized 1959)"
# x = x.replace("hungarians, hungarian")
# print(x)
# token_list = [token for token in nltk.word_tokenize(x) if token not in punctuation and token not in [
#         "flag", "people", "name", "republic", "of", "democratic", "nationality", "law", "flagicon"]]

# # token_list = {sbs.stem(i) for i in token_list}
# token_list = " ".join(token_list)
# print(token_list)

doc = nlp(x) #"american, american") #token_list) #("myanmar, burmese")
[(X.text, X.label_) for X in doc.ents]

# token_list = [token for token in nltk.word_tokenize("myanmar, burmese") if token not in punctuation and token not in [
#         "flag", "people", "name", "republic", "of", "democratic", "nationality", "law", "flagicon"]]

# # token_list = {sbs.stem(i) for i in token_list}
# token_list = " ".join(token_list)
# token_list

# # doc = nlp(token_list) #("myanmar, burmese")
# # [(X.text, X.label_) for X in doc.ents]

[('american (naturalized 1959', 'ORG')]

In [87]:
# nationality_df.head()

Unnamed: 0,bio,nationality,nationality_clean,nationality_clean2
0,Alain Connes (born 1 April 1947) is a French m...,french,french,{french}
1,Life\n=== Early life ===\nSchopenhauer's birth...,german,german,{german}
2,Life and career\nAlfred Nobel at a young age i...,swedish,swedish,{swedish}
3,"Early life\nAlfred Vogt (both ""Elton"" and ""van...",canadian,canadian,{canadian}
4,Alfons Maria Jakob (2 July 1884 in Aschaffenbu...,german,german,{german}


In [None]:
# nationality_df[nationality_df.nationality_basic.isna()].shape[0]/nationality_df.shape[0]
# nationality_df[nationality_df.nationality_clean3.isna()].sample(10)

In [None]:
# nationality_df = pd.read_csv("nationality_df.csv")

In [None]:
# def nationality_cleanup_basic(x):
#     "some basic clean up for the nationality field"
    
#     ##neither nltk nor spacy lemmatizers or stemmers worked so it appears nationality needs to be cleaned up manually
#     x = x.replace("united states", "american")
#     x = x.replace("u.s.", "american")
#     x = x.replace("usa", "american")
#     x = x.replace("united states of america", "american")
#     x = x.replace("united kingdom", "british")
#     x = x.replace("england", "british")
#     x = x.replace("english", "british")
#     # x = x.replace("india", "indian")
#     x = x.replace("france", "french")
#     x = x.replace("belgium", "belgian")
#     x = x.replace("jpn", "japanese")
#     x = x.replace("netherlands", "dutch")
#     x = x.replace("switzerland", "swiss")
#     x = x.replace("puerto ric", "puertoric")
#     # x = x.replace("south africa", "southafrica")
#     # x = x.replace("new zealand", "newzealand")
#     # x = x.replace("north korea", "northkorea")
#     # x = x.replace("south korea", "southkorea")
    
#     token_list = [token for token in nltk.word_tokenize(x) if token not in punctuation and token not in [
#         "flag", "people", "name", "republic", "of", "democratic", "nationality", "law", "flagicon"]]

#     return " ".join(token_list)

# nationality_df["nationality_basic"] = nationality_df.nationality.apply(lambda x: nationality_cleanup_basic(x))
# nationality_df.head(10)