# NER with HipHopHeads

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Install spaCy and Download Language Model

We need to install spaCy and the English-language model (`en_core_web_sm`). This is the model that was trained on the annotated "OntoNotes" corpus.

In [2]:
#!pip install -U spacy
!python -m spacy download en_core_web_sm

[38;5;3m⚠ Skipping model package dependencies and setting `--no-deps`. You
don't seem to have the spaCy package itself installed (maybe because you've
built from source?), so installing the model dependencies would cause spaCy to
be downloaded, which probably isn't what you want. If the model package has
other dependencies, you'll have to install them manually.[0m
Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
     |████████████████████████████████| 12.0 MB 9.3 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=e424438c97353fe4dcf67a70cc18890f3f96a7c44214d2deda3d1cf4277a681d
  Stored in directory: /private/var/folders/t

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [10]:
import spacy
from spacy import displacy
import en_core_web_sm

from collections import Counter
import pandas as pd
pd.options.display.max_rows = 1000

## Load Language Model

We need to load the language model and save it as the variable `nlp`

In [5]:
nlp = en_core_web_sm.load()

## Load Data

Read in CSV file

In [2]:
hiphop_df = pd.read_csv("top_1000_HipHipHeads_reddit_posts.csv")

## Your turn! 👩‍💻


Convert "text" column to a string, and make a list of all the posts called "texts"

In [7]:
#your code here

## Process Document

In the cell below, we open and read a text file. Then we run `nlp.pipe()` on the text to create our processed spaCy document.

In [8]:
chunked_documents = list(nlp.pipe(texts))

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

## Your turn! 👩‍💻


Alter the code from the previous notebook to create a DataFrame of all the people mentioned in all the top 1000 posts in r/HipHopHeads

In [None]:
# your code here

## Your turn! 👩‍💻


Make a bar plot of the top 10 most mentioned people

In [None]:
# your code here