# Text Analytics | BAIS:6100
# Module 2: Python Basics for Text Processing, Part 2

Instructor: Kang-Pyo Lee 

Topics to be covered:
- Modules & packages
- Writing & reading a File
- Handling dataframes
- Element selection from a dataframe
- Iteration over a dataframe
- Deriving a new column from existing columns
- Pattern matching using regular expressions

## Modules & Packages

A module is a file containing Python definitions and statements. The file name is the module name with the file extension .py appended. 
<br><br>
Packages are a way of structuring Python's module namespace by using "dotted module names". For example, the module name `A.B` designates a submodule named `B` in a package named `A`.

In [None]:
import math
math.sqrt(9)

One way to use a module in a package is to import the entire package the module belongs to into the current workspace. 

In [None]:
from math import sqrt
sqrt(9)

You can specify a submodule to load from a package. In this case, you do not call the package.

In [None]:
import numpy as np
import pandas as pd

You can give a local name to a module to be imported. 

External modules such as <i>numpy</i>, <i>pandas</i>, and <i>sklearn</i> should be installed in advance at an OS level using `pip install` command, not at a Python level. 

## Writing a File

*** Create a folder named `outcome` in your home folder on IDAS. 

When writing and reading a file, you can use the <b>with</b> keyword to open a file in a specific mode. 

In [None]:
with open("outcome/output.txt", mode="w") as fw:
    fw.write("Hello, world!\n")

open: https://docs.python.org/3/library/functions.html#open

The built-in <b>open</b> function opens the file provided and returns a corresponding file object. If the file cannot be opened, an OSError is raised. The parameter `mode` is an optional string that specifies the mode in which the file is opened. 
- "r": opens a file for reading. (default)
- "w": opens a file for writing. Creates a new file if it does not exist, or truncates the file if it exists.
- "a": opens for appending at the end of the file without truncating it. Creates a new file if it does not exist.
- "b": opens in binary mode.
- "+": opens a file for updating (reading and writing)

The <b>write</b> method writes the content of string to the file, returning the number of characters written. 

When the <b>with</b> statement ends, it automatically closes the open file and deletes the file object. 

In [None]:
print("Hello, world!\n", end="")

Note that writing a string to a file using the <b>write</b> method is similar to printing a string on a screen using the <b>print</b> function. They both need a string to be written or printed. 

In [None]:
with open("outcome/output.csv", mode="w") as fw:
    # Write the header row.
    fw.write("num\n")                # Use a new line (\n) between rows.
    
    # Write the value rows.
    for i in range(100):
        fw.write("{}\n".format(i))
        # print("{}\n".format(i), end="")

When writing a CSV file with only one column, you need to decide the delimiter to specify the boundary between separate rows. New line ('\n') is the most common delimiter for rows.

In [None]:
with open("outcome/output2.csv", mode="w") as fw:
    # Write the header row.
    fw.write("num,col1,col2\n")       # Use a comma (,) between columns and a new line (\n) between rows.
    
    # Write the value rows.
    for i in range(100):
        fw.write("{},{},{}\n".format(i, i*10, i*100))
        # print("{},{},{}\n".format(i, i*10, i*100), end="")

When writing a CSV file with multiple columns, you also need to decide the delimiter to specify the boundary between separate columns, e.g., comma (',') or tab ('\t'). Comma is the most common delimiter for columns.

In [None]:
with open("outcome/output3.csv", mode="w") as fw:
    # Write the header row.
    fw.write("num\tcol1\tcol2\n")       # Use a tab (\t) between columns and a new line (\n) between rows.
    
    # Write the value rows.
    for i in range(100):
        fw.write("{}\t{}\t{}\n".format(i, i*10, i*100))

When there is text in the data, however, comma is not a good choice, as there is good chance that some text could have commas.

When you determine the delimiter symbol for columns, it is important to make sure that none of the column data in the file contains the symbol. Otherwise, when loading the file, that symbol will be treated not as a normal character but as a delimiter, resulting in more columns in a row than expected. The same thing applies to the delimter symbol for rows. It results in more rows than expected. 

In [None]:
from IPython.display import Image
Image("classdata/images/file.png")

### Generating Random Strings

In [None]:
import string

The <b>string</b> module contains a collection of string constants such as ASCII lowercase/uppercase letters, digits, and special characters.

In [None]:
string.ascii_lowercase            # a string containing all ASCII lowercase letters

In [None]:
string.ascii_uppercase            # a string containing all ASCII uppercase letters

In [None]:
string.ascii_letters              # a string containing all ASCII letters

In [None]:
string.digits                     # a string containing all ASCII decimal digits

In [None]:
string.whitespace                 # a string containing all ASCII whitespace

In [None]:
string.punctuation                # a string containing all ASCII punctuation characters

In [None]:
string.printable                  # a string containing all ASCII characters considered printable

In [None]:
import random

In [None]:
random.choice(string.ascii_letters)     # a random character from ASCII letters 

random: https://docs.python.org/3/library/random.html

The <b>choice</b> function in the <b>random</b> module takes a non-empty sequence as an argument and returns a random element from the sequence.

In [None]:
[random.choice(string.ascii_letters) for i in range(10)] 

# Generate a list of 10 random characters from ASCII letters

In [None]:
"".join([random.choice(string.ascii_letters) for i in range(10)])   # 10 serves as the length of the generated random string.

# Generate a string of 10 random characters from ASCII letters

In [None]:
random.choice(range(10, 20))

# Generate a random number between 10 and 19

In [None]:
"".join([random.choice(string.ascii_letters) for i in range(random.choice(range(10, 20)))])

# Generate a string of 10-19 random characters from ASCII letters

In [None]:
with open("outcome/output4.csv", mode="w") as fw:
    # Write the header row.
    fw.write("num,col1,col2,col3\n")
    
    # Write the value rows.
    for i in range(100):
        num = i
        col1 = "".join([random.choice(string.ascii_letters) for i in range(random.choice(range(10, 20)))])
        col2 = "".join([random.choice(string.ascii_letters) for i in range(random.choice(range(10, 20)))])
        col3 = "".join([random.choice(string.ascii_letters) for i in range(random.choice(range(10, 20)))])
        
        fw.write("{},{},{},{}\n".format(num, col1, col2, col3))

In [None]:
def generate_random_string(seq, min_len, max_len):
    return "".join([random.choice(seq) for i in range(random.choice(range(min_len, max_len+1)))])

In [None]:
with open("outcome/output5.csv", mode="w") as fw:
    # Write the header row.
    fw.write("num,col1,col2,col3\n")
    
    # Write the value rows.
    for i in range(100):
        num = i
        col1 = generate_random_string(string.ascii_letters, 10, 19)
        col2 = generate_random_string(string.digits, 10, 19)
        col3 = generate_random_string(string.punctuation, 10, 19)
        
        fw.write("{},{},{},{}\n".format(num, col1, col2, col3))

## Reading a File

In [None]:
with open("outcome/output.txt", mode="r") as fr:
    content = fr.read()
    print(content) 

The <b>read</b> method reads the whole content in the file.

In [None]:
with open("outcome/output4.csv", mode="r") as fr:
    for line in fr:                        # Read the file line by line. 
        print(line, end="")

In [None]:
with open("outcome/output4.csv", mode="r") as fr:
    lines = fr.readlines()                 # Read the entire content in the file as a list of lines

    for line in lines:
        print(line, end="")

The <b>readlines</b> method reads the entire content in the file as a list of lines. This methods is not recommended when the file is too large to be loaded in memory.

In [None]:
with open("outcome/output4.csv", mode="r") as fr:
    lines = fr.readlines()
    
    # Decompose the header row into coloumn names
    header = lines[0]
    header = header.rstrip()               # Remove the trailing new line in the header
    num, col1, col2, col3 = header.split(",")
    
    # Decompose each line into values
    for line in lines[1:]:                 # Start from the second row
        line = line.rstrip()               # Remove the trailing new line in each line
        num_val, val1, val2, val3 = line.split(",")
        print("{}: {}, {}: {}, {}: {}, {}: {}".format(num, num_val, col1, val1, col2, val2, col3, val3), end="\n")

In [None]:
open("outcome/outputtt.csv", mode="r")

## Exercises - File Writing and Reading

## Data Structures in Pandas

In [None]:
from IPython.display import Image

Image(url="https://cdn-images-1.medium.com/max/800/0*PWbW0OdJJw49kxMt.png")

Series are one-dimensional arrays. A series has an index array, which is called just index.

In [None]:
Image(url="https://cdn-images-1.medium.com/max/800/0*dddYH8GijZanG4dO.png")

A dataframe is designed to extend series to two dimensions. A dataframe has two index arrays: a row index called just index and a column index called columns. A dataframe is, in fact, a collection of mulitple series, each of which shares an index. 

In [None]:
from IPython.display import Image
Image(url="https://i.stack.imgur.com/DL0iQ.jpg")

In Pandas, axis 0 refers to the row axis, while axis 1 to the column axis.

## Importing the Pandas Package

In [None]:
# ! pip install --user --upgrade pandas xlsxwriter

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)    # set the maximum column width to 150

## Reading a CSV File into a Pandas Dataframe

In [None]:
screen_name = "NASA"

Elon Musk on Twitter: https://twitter.com/elonmusk

In [None]:
df = pd.read_csv("classdata/tweets/timeline_{}.csv".format(screen_name), sep="\t")

pandas.read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

If you need to read a CSV file and analyze the content in a tabular format with rows and columns, it is a good idea to read the file as a Pandas dataframe. 

In [None]:
df

## Handling a Dataframe

In [None]:
df.shape        # (# of rows, # of columns)

In [None]:
df.columns      # the list of column labels

In [None]:
df.index        # the iterator for row index positions

In [None]:
list(df.index)

In [None]:
df.values

In [None]:
len(df)

The length of a dataframe is the number of rows in the dataframe. 

In [None]:
df.info()

pandas.DataFrame.info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

The <b>info</b> method shows a concise summary of a dataframe.

In [None]:
df.head()          # Return the first 5 rows

pandas.DataFrame.head: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [None]:
df.head(10)

In [None]:
df.tail()          # Return the last 5 rows

pandas.DataFrame.tail: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

When you meet with a new dataset, it is a good idea to start by looking at the first and last few rows to get a sense of what the entire dataset would look like.

## Selecting Elements from a Dataframe

In [None]:
df["text"]          # Return all values in the text column

In [None]:
df.text

`df.a` is quivalent to `df["a"]` if a is a string.

In [None]:
type(df.text)

A column in a dataframe is in fact a series.

In [None]:
df[0]

In [None]:
df[:3]                # Return the first 3 rows, the same as df.head(3)

In [None]:
df["text"][0]         # Return the element at row index position 0 and in the text column

Note that you should look up the column label first, followed by the row index number, each in separate matching brackets.

In [None]:
df[0]["text"]

In [None]:
df["text"][:3]        # Return the first 3 rows in the text column

In [None]:
df.iloc[0]            # Return the first row

<b>iloc</b> means index location. If there is only one argument inside the matching square brackets, the only argument is for the row index. 

In [None]:
type(df.iloc[0])

A row in a dataframe is a series too, just as a column in a dataframe is a series.

In [None]:
df.iloc[:, 0]          # Return all rows in the first column

If there are two arguments inside the matching square brackets, the first one is for the row index while the second for the column index. Note that when using <b>iloc</b> you should look up the row index numbers first and then the column index numbers, all in matching square brackets.

In [None]:
df.iloc[:, :2]         # Return all rows in the first 2 columns

In [None]:
df.iloc[:3, :2]        # Return the first 3 rows in the first 2 columns

In [None]:
df.iloc[-3:, -2:]      # Return the last 3 rows in the last 2 columns

In [None]:
df[df.retweet_count > 10000]     # Return all rows with its retweet_count column value being larger than 10000

You can set a condition as a filter inside the square brackets. 

In [None]:
df[(df.retweet_count > 10000) & (df.is_retweet == 0)]

You can set multiple conditions using Boolean operators.

In [None]:
df[(df.retweet_count > 10000) | (df.is_retweet == 0)]

In [None]:
df[["created_at", "text"]]

To select a subset of columns, you can list the columns as a filter inside the square brackets. The outer square brackets are for filtering, while the inner square brackets are for listing of columns. 

In [None]:
df.sample(n=10, replace=False, random_state=0)     # Create a random sample of 10 rows without duplicates

pandas.DataFrame.sample: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

The <b>sample</b> method returns a random sample of items from a dataframe. 

## Exercises - Element Selection from a Dataframe

## Iteration over a Dataframe

In [None]:
df2 = df[:10]
df2

### Iteration of Values in a Column of a Dataframe

In [None]:
for text in df2.text:
    print(text)
    print()

In [None]:
for idx, item in df2.text.iteritems():
    print("[{}] {}".format(idx, item))
    print()

pandas.Series.iteritems: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.iteritems.html

### Iteration of Rows in a Dataframe

There are multiple ways to iterate over a dataframe:  
- Using <b>iloc</b>
- Using the <b>iterrows</b> method to iterate over the rows as (index, series) pairs
- Using <b>itertuples</b> method to iterate over the rows as named tuples

You can choose any of the three above depending on how you want to retrieve data from a dataframe.

In [None]:
# Iterates over the rows as (index, series) pairs
for idx, series in df2.iterrows():
    print(idx)
    print(series)
    print()

In [None]:
# Iterates over the rows as (index, series) pairs
for idx, series in df2.iterrows():
    sid = series.status_id
    text = series.text
    print("[{}]\nsid: {}\ntext: {}\n".format(idx, sid, text))

You can decompose each series at each iteration into a set of variables.

## Writing a Series/Dataframe to a CSV/Excel File

In [None]:
df.text.to_csv("outcome/timeline_copy.csv", index=False)

In [None]:
df.to_csv("outcome/timeline_copy1.csv", sep="\t", index=False)

pandas.DataFrame.to_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [None]:
df[["created_at", "text"]][:100].to_csv("outcome/timeline_copy2.csv", sep="\t", index=False)

In [None]:
df.to_excel("outcome/timeline_copy.xlsx", sheet_name="Sheet1", index=False, engine='xlsxwriter')

pandas.DataFrame.to_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

## Exercises - Dataframe Iteration & File Writing

## Deriving a New Column from Existing Columns

The raw dataset itself does not always come with all the information you may need. In many cases, you will have to derive new columns from a set of existing columns. 

### Adding a New Column Representing Text Length (text ➔ text_length) 

In [None]:
Image("classdata/images/dataframe.png")

In [None]:
df.text.head()

In [None]:
df["text_length"] = df.text.apply(lambda x: len(x))

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

When using the <b>apply</b> method, pay attention to the existing column to be used (e.g., `text`) and the new column to be added (e.g., `text_length`).

In [None]:
df.columns

In [None]:
df[["text", "text_length"]]

### Adding a New Column Representing Whether Text Contains a Link (text ➔ contain_link) 

In [None]:
Image("classdata/images/dataframe2.png")

In [None]:
df.text.head()

In [None]:
df["contain_link"] = df.text.apply(lambda x: True if ("https://" in x) else False)

pandas.Series.apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

When using the <b>apply</b> method, pay attention to the existing column to be used (<i>text</i>) and the new column to be added (<i>text_length</i>).

In [None]:
df.columns

In [None]:
df[["text", "contain_link"]]

In [None]:
def check_link(s):
    if "https://" in s:
        return True
    else:
        return False

df["contain_link"] = df.text.apply(lambda x: check_link(x))

### Adding a New Column Combining Two Columns (crated_at, text ➔ text_detailed) 

In [None]:
Image("classdata/images/dataframe3.png")

In [None]:
df.text.head()

In [None]:
df["text_detailed"] = df.apply(lambda x: "[{}] {}".format(x.created_at, x.text), axis=1)

In [None]:
df.columns

In [None]:
df[["created_at", "text", "text_detailed"]]

In [None]:
def combine_columns(row):
    created_at = row.created_at
    text = row.text
    return "[{}] {}".format(created_at, text)
    
df["text_detailed"] = df.apply(lambda x: combine_columns(x), axis=1)

### Adding Columns Using the str Method of Pandas Series

Working with Text Data: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

In [None]:
df.text.str.split()     # Return a series of lists of tokens of the text column

pandas.Series.str.split: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

In [None]:
df["text_splits"] = df.text.str.split()

In [None]:
df[["text", "text_splits"]]

In [None]:
df.text.str.replace(" ", "_")     # Return a series of strings with whitespaces in the text column replaced with underscores

pandas.Series.str.replace: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html

In [None]:
df["text_nospace"] = df.text.str.replace(" ", "_")

In [None]:
df[["text", "text_nospace"]]

In [None]:
df.text.str.contains("mission", case=False) # Return a series of Boolean values indicating the text column has the word or not

pandas.Series.str.contains: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [None]:
df["mention_mission"] = df.text.str.contains("mission", case=False)

In [None]:
df[["text", "mention_mission"]]

In [None]:
df.text.str.count("#")   # Return a series of counts of the symbol in the text column

pandas.Series.str.count: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html

In [None]:
df["num_hashtags"] = df.text.str.count("#")

In [None]:
df[["text", "num_hashtags"]]

## Exercises - Adding Columns to a Dataframe

## Pattern Matching Using Regular Expressions

In [None]:
s = "From #GlobalGoals to @UNICEF supporters @BTS_twt, don't miss the new GIFs in our special #UNGA story on @GIPHY:… https://t.co/J0iEhGymBD"
s

Suppose we would like to know if there is any URL starting with '*https://*' in the tweet text. 

In [None]:
"https://" in s

We would also like to know how many URLs there are in the text. 

In [None]:
s.count("https://")

Now, what if we would need to find the entire URL in the text? 

In [None]:
s.split()

In [None]:
[item for item in s.split() if item.startswith("https://")]

In [None]:
[item for item in s.split() if item.startswith("@")]

In [None]:
[item for item in s.split() if item.startswith("#")]

It would be great if we can define a pattern to find matches.  

In [None]:
import re

In [None]:
re.findall(pattern="https://[A-Za-z0-9\./_]+", string=s)

The <b>findall</b>(pattern, string, flags=0) function in the <b>re</b> module returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

How to interpret the pattern:
1. starts with *https://*
2. followed by any alphabet (upper or lower case), digit, dot, slash, or underscore
3. that repeats at least once, but any number of times

In the above pattern, there are meta-characters such as \[, \], -, \\, and +.

In [None]:
Image("classdata/images/re1.jpg")

In [None]:
Image("classdata/images/re2.jpg")

In [None]:
Image("classdata/images/re3.jpg")

In [None]:
re.findall("@[A-Za-z0-9_]+", s)

In [None]:
re.findall("#[A-Za-z0-9_]+", s)

In [None]:
df["urls"] = df.text.apply(lambda x: re.findall("https://[A-Za-z0-9\./_]+", x))
df["user_mentions"] = df.text.apply(lambda x: re.findall("@[A-Za-z0-9_]+", x))
df["hashtags"] = df.text.apply(lambda x: re.findall("#[A-Za-z0-9_]+", x))

In [None]:
df[["text", "urls", "user_mentions", "hashtags"]]

In [None]:
df.text.str.findall(pat="https://[A-Za-z0-9\./_]+")

pandas.Series.str.findall: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html

In [None]:
df["urls"] = df.text.str.findall("https://[A-Za-z0-9\./_]+")
df["user_mentions"] = df.text.str.findall("@[A-Za-z0-9_]+")
df["hashtags"] = df.text.str.findall("#[A-Za-z0-9_]+")

In [None]:
df["text_nourl"] = df.text.str.replace("https://[A-Za-z0-9\./_]+", "")

pandas.Series.str.replace: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html

In [None]:
df[["text", "text_nourl"]]

### Extracting the Week Day

In [None]:
s = "Sun Dec 29 12:57:36 +0000 2019"

In [None]:
s[:3]

In [None]:
re.findall("^(Sun|Mon|Tue|Wed|Thu|Fri|Sat)", s)

The ^ meta-character to denote the start of the string. 

In [None]:
df["created_week_day"] = df.created_at.str.findall("^(Sun|Mon|Tue|Wed|Thu|Fri|Sat)")

In [None]:
df[["created_at", "created_week_day"]]

In [None]:
df["created_week_day"] = [item[0] for item in df.created_at.str.findall("^(Sun|Mon|Tue|Wed|Thu|Fri|Sat)")]

In [None]:
df[["created_at", "created_week_day"]]

### Extracting the Date

In [None]:
s

In [None]:
s[4:10]

In [None]:
re.findall("(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d\d", s)

In [None]:
re.findall("(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d\d", s)

The ?: meta-characters denote the non-capturing group.

In [None]:
df["created_date"] = [item[0] for item in df.created_at.str.findall("(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d\d")]

In [None]:
df[["created_at", "created_date"]]

### Extracting the Hour, Minute, and Second

In [None]:
s

In [None]:
s[11:19]

In [None]:
re.findall("\d\d:\d\d:\d\d", s)

In [None]:
re.findall("\d\d:(\d\d):\d\d", s)

In [None]:
re.findall("\d\d:(\d\d):(\d\d)", s)

### Extracting the Year

In [None]:
s

In [None]:
s[-4:]

In [None]:
re.findall("\d\d\d\d$", s)

Notice the $ meta-character to denote the end of the string. 

In [None]:
re.findall("\d{4}$", s)

### Extracting the Word before a Target Word

In [None]:
s = "We’ve got some exciting news to share!"

In [None]:
re.findall("[A-Za-z]+ (?:news|News|NEWS)", s)

In [None]:
re.search("[A-Za-z]+ (?:news|News|NEWS)", s)

In [None]:
e = "[A-Za-z]+ (?:news|News|NEWS)"
[re.findall(e, text) for text in df.text if re.search(e, text)]

### Extracting the User Mention after the Word RT

In [None]:
s = "RT @NASAJPL: Today, we're saying goodbye to one of the greats. Data gathering will be completed, and the science lives on."

In [None]:
re.findall("^RT (@[A-Za-z_]+):", s)

In [None]:
e = "^RT (@[A-Za-z_]+):"
[re.findall(e, text) for text in df.text if re.search(e, text)]

## Exercises - Regular Expressions