# Basic Examples for Dealing with Text

In [1]:
import numpy as np
import pandas as pd

## Importing text data

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

In [4]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [5]:
len(twenty_train.data)

2257

In [6]:
twenty_train.data[0]

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

In [7]:
print(twenty_train.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [8]:
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


In [9]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

### Import test set

In [10]:
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data

## Text vectorization example with Bag of words and TF-IDF

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://en.wikipedia.org/wiki/Tf-idf

### Way 1: Bag of words and TF-IDF separately

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
#Some arguments: max_features=1500, min_df=5, max_df=0.7, ngram_range=(1,2), stop_words=nltk.stopwords.words('english')
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [12]:
X_train_counts

<2257x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

In [13]:
X_train_counts[2,4]

0

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### Way 2: both at once

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer()#max_features=1500, min_df=5, max_df=0.7, ngram_range=(1,2)
X = tfidfconverter.fit_transform(twenty_train.data)
X.shape

(2257, 35788)

### ML

An direct example of classification, using  the TF-IDF features in a "dummy" naive Bayes model.

In [16]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [17]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print(doc, "=>", twenty_train.target_names[category])

God is love => soc.religion.christian
OpenGL on the GPU is fast => comp.graphics


### To go further...

Other fancier word embedding methods can be found in recent literature, and some are available online (word2vec, GloVe, fastText, Bert transformer, ...) in trainable or pretrained versions.

## Some useful str methods to wrangle text

In [18]:
import string

In [19]:
text = "Hi there!"
text

'Hi there!'

In [20]:
text.replace("Hi","Hello")

'Hello there!'

In [21]:
text.replace("!"," ! ")

'Hi there ! '

In [22]:
text.replace("e","")

'Hi thr!'

In [23]:
text.split(" ")

['Hi', 'there!']

In [24]:
text.lower()

'hi there!'

In [25]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The few examples above are just to inspire you some ideas. There are many things you could think of to analyze and extract informative summaries from text data. The pandas `<pd.Series>.apply()` method can come in very handy with custom user-defined functions.

pd.Series also has a `str` subset of functions for text data. Here are a few dummy examples.

In [26]:
str_text = pd.DataFrame({"my_text":["Hi there!","My dog is cute.","i lost my wallet"]})
str_text

Unnamed: 0,my_text
0,Hi there!
1,My dog is cute.
2,i lost my wallet


In [27]:
str_text.my_text.str.capitalize()

0           Hi there!
1     My dog is cute.
2    I lost my wallet
Name: my_text, dtype: object

In [28]:
str_text.my_text.str.lower()

0           hi there!
1     my dog is cute.
2    i lost my wallet
Name: my_text, dtype: object

In [29]:
str_text.my_text.str.contains("y")

0    False
1     True
2     True
Name: my_text, dtype: bool

In [30]:
str_text.my_text.str.contains("my")

0    False
1    False
2     True
Name: my_text, dtype: bool

In [31]:
str_text.my_text.str.count("e")

0    2
1    1
2    1
Name: my_text, dtype: int64

In [32]:
str_text.my_text.str.replace("e","")

0            Hi thr!
1     My dog is cut.
2    i lost my wallt
Name: my_text, dtype: object

Many more examples in the pandas documentation.