# Natural Language Processing - NLP
- text as data
- cleaning
- basic summaries
- modelling text

#### basics of text manipulation
- transforming corpora, documents, sentences, words
    - what part of the data do we want to keep?
    - what part doe we **not** want to keep?

In [1]:
# forms of text data:
sentence = 'This is a sentence.'
sentence_2 = ['Is', 'this', 'a', 'sentence', '?']
word = 'word'
document = [
    'This docuemnt consists of multiple sentences.',
    'This is one of them.',
    'May this be another sentence?',
    'There are too many sentences in this docuement!'
]
document_2 = """
    This document consists of multiple sentences.
    This is one of them.
    May this be another sentence?
    There are too many sentences in this docuement!"""
corpus = [document, sentence, document_2, word]

*It depends on what we want to do with it!*

## The "Rotten Tomatoes" dataset
- find the dataset [here](https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset?resource=download) (requires a kaggle account)
- save it in the "css_intro/local" folder
- contains review data on a set of movies 

In [2]:
import pandas as pd

In [3]:
reviews = pd.read_csv('../local/rotten_tomatoes_critic_reviews.csv')
movies = pd.read_csv('../local/rotten_tomatoes_movies.csv')

- [ ] look at the dataset contents, what could be interesting?

In [4]:
reviews.head(2)

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."


In [5]:
movies.head(2)

Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,content_rating,genres,directors,authors,actors,original_release_date,...,production_company,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,...,20th Century Fox,Rotten,49.0,149.0,Spilled,53.0,254421.0,43,73,76
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,...,Sony Pictures Classics,Certified-Fresh,87.0,142.0,Upright,64.0,11574.0,44,123,19


In [6]:
merged_df = pd.merge(movies, reviews, how='left', on='rotten_tomatoes_link')

In [7]:
merged_short = merged_df[merged_df['review_content'].isnull()==False]

## Cleaning the textdata
- we are not interested in extremely common words like conjunctions, articles, etc.
- we want to combine words
    - that have the same meaning but are e.g. conjugated ("ran" and "running")
    - and words with differing capitalization
- we want to remove urls and maybe hashtags, @-mentions, numbers etc. but almost definately punctuation

In [41]:
testsentence = 'This is a test sentence, containing punctuations (, ; : # @), CAPITALIZATION and conjugated words like running, ran, is, be and also a url - www.thisisaurl.com'

In [42]:
# lower all words
lower = testsentence.lower()
print(lower)

this is a test sentence, containing punctuations (, ; : # @), capitalization and conjugated words like running, ran, is, be and also a url - www.thisisaurl.com


In [43]:
# remove urls
losplit = lower.split(' ')
print(losplit)

['this', 'is', 'a', 'test', 'sentence,', 'containing', 'punctuations', '(,', ';', ':', '#', '@),', 'capitalization', 'and', 'conjugated', 'words', 'like', 'running,', 'ran,', 'is,', 'be', 'and', 'also', 'a', 'url', '-', 'www.thisisaurl.com']


In [44]:
nourl = [word for word in losplit if 'www' not in word]
print(nourl)

['this', 'is', 'a', 'test', 'sentence,', 'containing', 'punctuations', '(,', ';', ':', '#', '@),', 'capitalization', 'and', 'conjugated', 'words', 'like', 'running,', 'ran,', 'is,', 'be', 'and', 'also', 'a', 'url', '-']


In [45]:
nourl = ' '.join(nourl)
print(nourl)

this is a test sentence, containing punctuations (, ; : # @), capitalization and conjugated words like running, ran, is, be and also a url -


In [56]:
# remove punctuation
import string
nopunct = nourl.translate(str.maketrans('', '', string.punctuation))
print(nopunct)

In [57]:
nopunct = ' '.join([part for part in nopunct.split(' ') if len(part) > 0])
print(nopunct)

this is a test sentence containing punctuations capitalization and conjugated words like running ran is be and also a url


In [58]:
import spacy