# Wednesday Code Challenge

Below is starter code, similar to yesterdays code challenge

In [1]:
import pandas as pd
import json
import re
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors


nlp = spacy.load("en_core_web_lg")
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS


with open('documents.json') as f:
    df = pd.DataFrame(json.load(f)).T.drop(columns = ['emails', 'institutions', 'people', 'places'])

df.head()

Unnamed: 0,contents,filename
Navigation to Small Bodies,"See discussions, stats, and author profiles fo...",txt_files/Navigation to Small Bodies.txt
ASTRONOMICAL ENGINEERING,ASTRONOMICAL ENGINEERING: A STRATEGY FOR MODIF...,txt_files/ASTRONOMICAL ENGINEERING.txt
Phase II of the Main Belt Asteroid Spectrosopic Survey,"Icarus 158, 146�177 (2002) doi:10.1006/icar.20...",txt_files/Phase II of the Main Belt Asteroid S...
Devlopment of Xenon Hall Thrusters,NASA/CR--2004-213099https://ntrs.nasa.gov/sear...,txt_files/Devlopment of Xenon Hall Thrusters.txt
Mine planning for Asteroid Ore Bodies,Space Resources Roundtable II (2000)7030.pdfMI...,txt_files/Mine planning for Asteroid Ore Bodie...


---

In [2]:
def tokenize(x):
    text = x.lower()
    text = re.sub(r'[^a-zA-Z ^0-9]', '', str(text))
    return text.split()


def spacy_lemmatize(x):
    doc = nlp.tokenizer(x)
    return [token.lemma_ for token in doc]


def remove_stopwords(tokens):
    cleaned_tokens = []
    for token in tokens:
        if token not in spacy_stopwords:
            cleaned_tokens.append(token)
    return ' '.join(cleaned_tokens)

In [3]:
df['tokens'] = df['contents'].apply(lambda x: tokenize(x))
df['tokens'] = df['tokens'].apply(lambda x: spacy_lemmatize(' '.join(x)))
df['tokens'] = df['tokens'].apply(lambda x: remove_stopwords(x))

In [4]:
df['tokens'].head()

Navigation to Small Bodies                                discussion stats author profile publication ht...
ASTRONOMICAL ENGINEERING                                  astronomical engineer strategy modify planetar...
Phase II of the Main Belt Asteroid Spectrosopic Survey    icarus 158 146177 2002 doi101006icar20026856ph...
Devlopment of Xenon Hall Thrusters                        nasacr2004213099httpsntrsnasagovsearchjspr2004...
Mine planning for Asteroid Ore Bodies                     space resource roundtable ii 20007030pdfmine p...
Name: tokens, dtype: object

---

## 1. Create a document-term matrix from the collection of papers using TF-IDF.

In [None]:
tfidf = TfidfVectorizer(stop_words='english', max_features = 5000)

dtm = tfidf.fit_transform(df['tokens'])

docs = pd.DataFrame(dtm.todense(), columns = tfidf.get_feature_names())

docs.head()

In [6]:
docs.index = df.index

In [None]:
docs.head()

---

## 2. Explain what TF-IDF is.

Term frequency, stated simply, represents how often a term appears in a document divided by total number of terms in the document.
Inverse document frequency is the log of the total number of documents divided by documents with a selected term. TDF-IDF is the product of these two huristics.

### Excerpt From: Delip Rao and Brian McMahan. “Natural Language Processing with PyTorch.”

Consider a collection of patent documents. You would expect most of them to contain words like claim, system, method, procedure, and so on, often repeated multiple times. The TF representation weights words proportionally to their frequency. However, common words such as “claim” do not add anything to our understanding of a specific patent. Conversely, if a rare word (such as “tetrafluoroethylene”) occurs less frequently but is quite likely to be indicative of the nature of the patent document, we would want to give it a larger weight in our representation. The Inverse-Document-Frequency (IDF) is a heuristic to do exactly that.

The IDF representation penalizes common tokens and rewards rare tokens in the vector representation. The IDF(w) of a token w is defined with respect to a corpus as:

IDF(w) = log(N/n_w)

where nw is the number of documents containing the word w and N is the total number of documents. The TF-IDF score is simply the product TF(w) * IDF(w). First, notice how if there is a very common word that occurs in all documents (i.e., nw = N), IDF(w) is 0 and the TF-IDF score is 0, thereby completely penalizing that term. Second, if a term occurs very rarely, perhaps in only one document, the IDF will be the maximum possible value, log N. Example 1-2 shows how to generate a TF-IDF representation of a list of English sentences using scikit-learn.
In deep learning, it is rare to see inputs encoded using heuristic representations like TF-IDF because the goal is to learn a representation. Often, we start with a one-hot encoding using integer indices and a special “embedding lookup” layer to construct inputs to the neural network. In later chapters, we present several examples of doing this.


---

## 3. Using named entity extraction, extract all the people, geographic locations, and academic/industry institutions  from the contents of each paper. 

### Store each of these values in a new column. 

#### Hint: Check out the Spacy documentation for information regarding named entity extraction

This will be far from perfect, and could take a few mins to run.

In [8]:
def extract_people(contents):
    doc = nlp(contents)
    text = [entity.text for entity in doc.ents]
    labels = [entity.label_ for entity in doc.ents]
    
    df = pd.DataFrame({'text': text, 'labels':labels})
    return df.where(df.labels == 'PERSON').dropna().text.tolist()


def extract_places(contents):
    doc = nlp(contents)
    text = [entity.text for entity in doc.ents]
    labels = [entity.label_ for entity in doc.ents]
    
    df = pd.DataFrame({'text': text, 'labels':labels})
    return df.where(df.labels == 'GPE').dropna().text.tolist()


def extract_institutions(contents):
    doc = nlp(contents)
    text = [entity.text for entity in doc.ents]
    labels = [entity.label_ for entity in doc.ents]
    
    df = pd.DataFrame({'text': text, 'labels':labels})
    return df.where(df.labels == 'ORG').dropna().text.tolist()

In [9]:
df['people'] = df['tokens'].apply(lambda x : extract_people(x))
df['places'] = df['tokens'].apply(lambda x : extract_places(x))
df['institutions'] = df['tokens'].apply(lambda x : extract_institutions(x))

In [10]:
df.head()

Unnamed: 0,contents,filename,tokens,people,places,institutions
Navigation to Small Bodies,"See discussions, stats, and author profiles fo...",txt_files/Navigation to Small Bodies.txt,discussion stats author profile publication ht...,"[jekan thangavelautham, filenavigating smallbo...","[arizona, az, arizona, az 85287, arizona, az, ...","[satellitesarticle ieee aerospace, nallapu ari..."
ASTRONOMICAL ENGINEERING,ASTRONOMICAL ENGINEERING: A STRATEGY FOR MODIF...,txt_files/ASTRONOMICAL ENGINEERING.txt,astronomical engineer strategy modify planetar...,"[anson, sridhar tremaine, weissman, gravityass...","[alalthough earth, moon, santa cruz, new york ...",[orbitsdg korycansky codep dept earth science ...
Phase II of the Main Belt Asteroid Spectrosopic Survey,"Icarus 158, 146�177 (2002) doi:10.1006/icar.20...",txt_files/Phase II of the Main Belt Asteroid S...,icarus 158 146177 2002 doi101006icar20026856ph...,"[icarus 158, surveya featurebased taxonomysche...","[ctype, smassi, vesta zone, sclass, sclass, gc...","[science massachusetts institute, cambridge ma..."
Devlopment of Xenon Hall Thrusters,NASA/CR--2004-213099https://ntrs.nasa.gov/sear...,txt_files/Devlopment of Xenon Hall Thrusters.txt,nasacr2004213099httpsntrsnasagovsearchjspr2004...,"[nasasee, lawjune, peerreviewed, nasasee, glen...","[facilities122621, wien, michigan, coilsn^unit...","[xenon hall, hofer university, michigan ann ar..."
Mine planning for Asteroid Ore Bodies,Space Resources Roundtable II (2000)7030.pdfMI...,txt_files/Mine planning for Asteroid Ore Bodie...,space resource roundtable ii 20007030pdfmine p...,"[welldesigned tetherrestraintplatform, oleary ...","[toronto, earth object, earth object]","[cohesivenessasteroid, fragmentationrestraint ..."


---