### <center> Basic Text Analysis</center>

Source: http://hamelg.blogspot.ca/2015/11/python-for-data-analysis-part-15.html

In [79]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [80]:
comments = pd.read_csv('Data/comments.csv')
comments = comments['body']

In [81]:
print(comments.shape)

(4166,)


In [82]:
print (comments.head(8))

0    Strongly encouraging sign for us.  The T-Wolve...
1    [My reaction.](http://4.bp.blogspot.com/-3ySob...
2                     http://imgur.com/gallery/Zch2AWw
3    Wolves have more talent than they ever had rig...
4    Nah. Wigg is on the level of KG but where's ou...
5           2004 was a pretty damn talented team dude.
6                                                  :')
7                                              *swoon*
Name: body, dtype: object


### Pandas String Functions

In [83]:
comments[0].lower() # Convert the first comment to lowercase

"strongly encouraging sign for us.  the t-wolves management better not screw this up and they better surround wiggins with a championship caliber team to support his superstar potential or else i wouldn't want him to sour his prime years here in minnesota just like how i felt with garnett.\n\ntl;dr: wolves better not fuck this up."

In [84]:
comments.str.lower().head(8)  # Convert all comments to lowercase

0    strongly encouraging sign for us.  the t-wolve...
1    [my reaction.](http://4.bp.blogspot.com/-3ysob...
2                     http://imgur.com/gallery/zch2aww
3    wolves have more talent than they ever had rig...
4    nah. wigg is on the level of kg but where's ou...
5           2004 was a pretty damn talented team dude.
6                                                  :')
7                                              *swoon*
Name: body, dtype: object

In [85]:
comments.str.upper().head(8)  # Convert all comments to uppercase

0    STRONGLY ENCOURAGING SIGN FOR US.  THE T-WOLVE...
1    [MY REACTION.](HTTP://4.BP.BLOGSPOT.COM/-3YSOB...
2                     HTTP://IMGUR.COM/GALLERY/ZCH2AWW
3    WOLVES HAVE MORE TALENT THAN THEY EVER HAD RIG...
4    NAH. WIGG IS ON THE LEVEL OF KG BUT WHERE'S OU...
5           2004 WAS A PRETTY DAMN TALENTED TEAM DUDE.
6                                                  :')
7                                              *SWOON*
Name: body, dtype: object

In [86]:
comments.str.len().head(8)

0    329
1    101
2     32
3     53
4    145
5     42
6      3
7      7
Name: body, dtype: int64

In [87]:
comments.str.split(" ").head(8)

0    [Strongly, encouraging, sign, for, us., , The,...
1    [[My, reaction.](http://4.bp.blogspot.com/-3yS...
2                   [http://imgur.com/gallery/Zch2AWw]
3    [Wolves, have, more, talent, than, they, ever,...
4    [Nah., Wigg, is, on, the, level, of, KG, but, ...
5    [2004, was, a, pretty, damn, talented, team, d...
6                                                [:')]
7                                            [*swoon*]
Name: body, dtype: object

In [88]:
comments.str.strip("[]").head(8)  # Strip leading and trailing bracket

0    Strongly encouraging sign for us.  The T-Wolve...
1    My reaction.](http://4.bp.blogspot.com/-3ySobv...
2                     http://imgur.com/gallery/Zch2AWw
3    Wolves have more talent than they ever had rig...
4    Nah. Wigg is on the level of KG but where's ou...
5           2004 was a pretty damn talented team dude.
6                                                  :')
7                                              *swoon*
Name: body, dtype: object

In [89]:
comments.str.cat()[0:500] # Check the first 500 characters

"Strongly encouraging sign for us.  The T-Wolves management better not screw this up and they better surround Wiggins with a championship caliber team to support his superstar potential or else I wouldn't want him to sour his prime years here in Minnesota just like how I felt with Garnett.\n\nTL;DR: Wolves better not fuck this up.[My reaction.](http://4.bp.blogspot.com/-3ySobv38ihc/U6yxpPwsbzI/AAAAAAAAIPo/IO8Z_wbTIVQ/s1600/2.gif)http://imgur.com/gallery/Zch2AWwWolves have more talent than they ever"

In [90]:
comments.str.slice(0,10).head(8) # Slice the first 10 characters

0    Strongly e
1    [My reacti
2    http://img
3    Wolves hav
4    Nah. Wigg 
5    2004 was a
6           :')
7       *swoon*
Name: body, dtype: object

In [91]:
comments.str[0:10].head(8) # Slice the first 10 characters

0    Strongly e
1    [My reacti
2    http://img
3    Wolves hav
4    Nah. Wigg 
5    2004 was a
6           :')
7       *swoon*
Name: body, dtype: object

In [92]:
comments.str.slice_replace(5, 10, 'Wolves Rule! ').head()

0    StronWolves Rule! ncouraging sign for us.  The...
1    [My rWolves Rule! on.](http://4.bp.blogspot.co...
2             http:Wolves Rule! ur.com/gallery/Zch2AWw
3    WolveWolves Rule! e more talent than they ever...
4    Nah. Wolves Rule! is on the level of KG but wh...
Name: body, dtype: object

In [93]:
comments.str.replace('Wolves', 'Pups').head(8)

0    Strongly encouraging sign for us.  The T-Pups ...
1    [My reaction.](http://4.bp.blogspot.com/-3ySob...
2                     http://imgur.com/gallery/Zch2AWw
3    Pups have more talent than they ever had right...
4    Nah. Wigg is on the level of KG but where's ou...
5           2004 was a pretty damn talented team dude.
6                                                  :')
7                                              *swoon*
Name: body, dtype: object

In [94]:
logical_index = comments.str.lower().str.contains('wigg|drew')

comments[logical_index].head(10)  #Get first 10 comments about Wiggins

0     Strongly encouraging sign for us.  The T-Wolve...
4     Nah. Wigg is on the level of KG but where's ou...
9                            I FUCKING LOVE YOU ANDREW 
10                                   I LOVE YOU WIGGINS
33    Yupiii!!!!!! Great Wiggins celebration!!!!! =D...
44                         Wiggins on the level of KG?!
45    I'm comfortable with saying that Wiggins is as...
62       They seem so Wiggins. Did he help design them?
63    The more I think about this the more I can und...
64    I dig these a lot. Like the AW logo too with t...
Name: body, dtype: object

In [95]:
# calculate the ratio of comments that mention Andrew Wiggins:
len(comments[logical_index])/len(comments)

0.06649063850216035

### Regular Expressions


"." - The period is a metacharacter that matches any character other than a newline:

In [96]:
my_series = pd.Series(['will','bill','Till','still','gull'])

my_series.str.contains('.ill')  # Match any substring ending in ill

0     True
1     True
2     True
3     True
4    False
dtype: bool


"[ ]" - Square brackets specify a set of characters to match:

In [97]:
my_series.str.contains("[Tt]ill") # Matches T or t followed by "ill"

0    False
1    False
2     True
3     True
4    False
dtype: bool

Regular expressions include several special character sets that allow to quickly specify certain common character types. They include:<br><br>
[a-z] - match any lowercase letter <br>
[A-Z] - match any uppercase letter <br>
[0-9] - match any digit <br>
[a-zA-Z0-9] - match any letter or digit<br>
Adding the "^" symbol inside the square brackets matches any characters NOT in the set:<br><br>
[^a-z] - match any character that is not a lowercase letter <br>
[^A-Z] - match any character that is not a uppercase letter <br>
[^0-9] - match any character that is not a digit <br>
[^a-zA-Z0-9] - match any character that is not a letter or digit<br>
Python regular expressions also include a shorthand for specifying common sequences:<br><br>
\d - match any digit <br>
\D - match any non digit <br>
\w - match a word character<br>
\W - match a non-word character <br>
\s - match whitespace (spaces, tabs, newlines, etc.) <br>
\S - match non-whitespace<br>
"^" - outside of square brackets, the caret symbol searches for matches at the beginning of a string:<br>

In [98]:
ex_strl = pd.Series(['Where did he go', 'He went to the mall', 'he is good'])

ex_strl.str.contains('^(He|he)') # Matches He or he at the start of a string

0    False
1     True
2     True
dtype: bool

"$" - searches for matches at the end of a string:

In [99]:
ex_strl.str.contains('(go)$') # Matches go at the end of a string

0     True
1    False
2    False
dtype: bool

( )" - parentheses in regular expressions are used for grouping and to enforce the proper order of operations just like they are in math and logical expressions. In the examples above, the parentheses let us group the or expressions so that the "^" and "$" symbols operate on the entire or statement.<br><br>
"*" - an asterisk matches zero or more copies of the preceding character<br><br>
"?" - a question mark matches zero or 1 copy of the preceding character<br><br>
"+" - a plus matches 1 more copies of the preceding character<br><br>

In [103]:
ex_str2 = pd.Series(["abdominal","b","aa","abbcc","aba"])

# Match 0 or more a's, a single b, then 1 or characters
ex_str2.str.contains('a*b.+')

0     True
1    False
2    False
3     True
4     True
dtype: bool

In [104]:
# Match 1 or more a's, an optional b, then 1 or a's
ex_str2.str.contains('a+b?a+')

0    False
1    False
2     True
3    False
4     True
dtype: bool

"{ }" - curly braces match a preceding character for a specified number of repetitions:<br><br>
"{m}" - the preceding element is matched m times<br><br>
"{m,}" - the preceding element is matched m times or more<br><br>
"{m,n}" - the preceding element is matched between m and n times<br><br>

In [105]:
ex_str3 = pd.Series(["aabcbcb","abbb","abbaab","aabb"])

ex_str3.str.contains("a{2}b{2,}")   # Match 2 a's then 2 or more b's

0    False
1    False
2    False
3     True
dtype: bool


"\" - backslash let you "escape" metacharacters. You must escape metacharacters when you actually want to match the metacharacter symbol itself. For instance, if you want to match periods you can't use "." because it is a metacharacter that matches anything. Instead, you'd use "." to escape the period's metacharacter behavior and match the period itself:

In [107]:
ex_str4 = pd.Series(["Mr. Ed","Dr. Mario","Miss\Mrs Granger."])

ex_str4.str.contains(r"\\") #Match strings containing a backslash

0    False
1    False
2     True
dtype: bool


Raw strings are often used for regular expression patterns because they avoid issues that may that arise when dealing with special string characters.<br><br>
There are more regular expression intricacies we won't cover here, but combinations of the few symbols we've covered give you a great amount of expressive power. Regular expressions are commonly used to perform tasks like matching phone numbers, email addresses and web addresses in blocks of text.<br><br>
To use regular expressions outside of pandas, you can import the regular expression library with: import re.<br><br>
Pandas has several string functions that accept regex patterns and perform an operation on each string in series. We already saw two such functions: series.str.contains() and series.str.replace(). Let's go back to our basketball comments and explore some of these functions.<br><br>
Use series.str.count() to count the occurrences of a pattern in each string:

In [110]:
comments.str.count(r'[Ww]olve').head(8)

0    2
1    0
2    0
3    1
4    0
5    0
6    0
7    0
Name: body, dtype: int64

Use series.str.findall() to get each matched substring and return the result as a list:

In [111]:
comments.str.findall(r"[Ww]olves").head(8)

0    [Wolves, Wolves]
1                  []
2                  []
3            [Wolves]
4                  []
5                  []
6                  []
7                  []
Name: body, dtype: object

### Getting Posts with Web Links

In [113]:
web_links = comments.str.contains(r'https?:')

posts_with_links = comments[web_links]

print(len(posts_with_links))

posts_with_links.head(5)

216


1     [My reaction.](http://4.bp.blogspot.com/-3ySob...
2                      http://imgur.com/gallery/Zch2AWw
25    [January 4th, 2005 - 47 Pts, 17 Rebs](https://...
29    [You're right.](http://espn.go.com/nba/noteboo...
34    https://www.youtube.com/watch?v=K1VtZht_8t4\n\...
Name: body, dtype: object

In [115]:
only_links = posts_with_links.str.findall(r"https?:[^ \n\)]+")

only_links.head(10)

1     [http://4.bp.blogspot.com/-3ySobv38ihc/U6yxpPw...
2                    [http://imgur.com/gallery/Zch2AWw]
25    [https://www.youtube.com/watch?v=iLRsJ9gcW0Y, ...
29    [http://espn.go.com/nba/notebook/_/page/ROY141...
34        [https://www.youtube.com/watch?v=K1VtZht_8t4]
40        [https://www.youtube.com/watch?v=mFEzW1Z6TRM]
69                [https://instagram.com/p/2HWfB3o8rK/]
76    [https://www.youtube.com/watch?v=524h48CWlMc&a...
93                     [http://i.imgur.com/OrjShZv.jpg]
95    [http://content.sportslogos.net/logos/6/232/fu...
Name: body, dtype: object