# Python Numpy Recap & Text Analysis

### 1.  Dynamic Typing Example
Python is dynamically typed language.
The Python interpreter does type checking only as code runs.
The type of a variable is allowed to change over its lifetime.
https://docs.python.org/3/tutorial/

In [None]:
x= 10
type(x)

In [None]:
x = 'MyName'
type(x)

In [None]:
x= True
type(x)

In [None]:
x= 20.353
type(x)

### 2. Nested For loop in list comprehension

In [None]:
#simple for loop example
square = []
for i in range(1,21):
    square.append(i**2)
print(square)


In [None]:
#simple list comprehesion for the same loop example
[i**2 for i in range(1,21)]

In [None]:
# list coomprehensions produces tuples resulting from nested loops
#structure: [(expr1(item1), expr2(item2))]
#    for item1 in iter2
#        for item2 in iter2]
[(row,col) for row in range(6) for col in range(5)]    

### 3. Formatting
Formatted string literals (also called f-strings for short) let you include the value of Python expressions inside a string by prefixing the string with f or F and writing expressions as {expression}.


In [None]:
# This allows greater control over how the value is formatted. 
# The following example rounds pi to three places after the decimal:
import math
print(f'The value of pi is approximately {math.pi:.3f}.')

In [None]:
from datetime import date
year = 2021
place = 'Washington University'
time = date.today()
f'Wecome to {place} in this {year} on {time}!'

In [None]:
table = {'Anna': 123, 'Jack': 456, 'Dan': 789}
for name, phone in table.items():   
    print(f'{name:10} <====> {phone:10d}')

### 4. Functions 

In [None]:
#functions
def f(x, y):
    return 10 * x + y

In [None]:
f(10,20)

### 5. Numpy
Advantages of NumPy
1. extremely fast,compared to core Python (heavy use of C extensions)
2. advanced libraries (Scikit-Learn, Scipy, and Keras etc.) make extensive use of the NumPy library

In [None]:
# importing the numpy package
import numpy as np

In [None]:
a = np.arange(15)
print("array a is:\n", a)
print("Shape of the array a\n", a.shape)
print("Dimension of the array a\n",a.ndim)
print("Size of the array a\n",a.size)
print("Data type of the array a\n",type(a))
print("Data Type Name of the array a\n",a.dtype.name)


In [None]:
a = np.arange(15).reshape(3, 5)
print("array a is:\n", a)
print("Shape of the array a\n", a.shape)
print("Dimension of the array a\n",a.ndim)
print("Size of the array a\n",a.size)
print("Data type of the array a\n",type(a))
print("Data Type Name of the array a\n",a.dtype.name)


In [None]:
#The function zeros creates an array full of zeros and ones creates an array full of ones.
# By default, the dtype of the created array is float64,
# but it can be specified via the key word argument dtype.

#speciy complex datatype
C = np.array([[1, 2], [3, 4]], dtype=complex)
print("Complex data \n", C)
C1 = np.arange(6) # 1d array
print("1d array\n", C1)
C2 = np.arange(12).reshape(4, 3) # 2nd array
print("2nd array\n", C2)
C3 = np.arange(24).reshape(2, 3, 4)  # 3d array
print("3d array\n", C3)
D = np.zeros((3, 4))
print("array with zeros \n",D)
# specify the data type
E = np.ones((2, 3, 4), dtype=np.int16)
print("array with ones \n",E)

In [None]:
#operations[+=,*=] act in place to modify an existing array rather than create a new one.
# shape, min(), max(), sum(), slicing, indexing,
print(C3.shape , C3.min(), C3.max(), C3.sum())
#Iterating over multidimensional arrays is done with respect to the first axis
for row in C3:
    print(row)

In [None]:
print(C3.ravel())  # returns the array, flattened
#flat attribute which is an iterator over all the elements of the array
for element in C3.flat:
    print(element)

In [None]:
A = np.array([[1, 1],
              [0, 1]])
B = np.array([[2, 0],
              [3, 4]])
#Basic operations
print(A - B)
print(A*2)
print(B**2)
print(A * B)     # elementwise product
print(A @ B)     # matrix product
print(A.dot(B))  # another matrix product
print(np.exp(B)) # exponential operations (repeated multiplications)
print(np.sqrt(B))# squareroot operation

In [None]:
#Shape Manipulation
#Changing the shape of an array
rg = np.random.default_rng(1)  
# create instance of default random number generator
a = np.floor(10 * rg.random((3, 4)))
print(a)
print(a.T)              # returns the transpose
print(a.reshape(6, 2))  # returns the array with a modified shape


In [None]:
#Stacking together different arrays
a = np.floor(10 * rg.random((2, 2)))
b = np.floor(10 * rg.random((2, 2)))
print(a)
print(b)
print(np.vstack((a, b)))
print(np.hstack((a, b)))


#### List vs Array Performance

In [None]:
import random
%timeit rolls_list = [random.randrange(1,7) for i in range(0,6_000_000)]

In [None]:
%timeit rolls_array = np.random.randint(1,7,6_000_000)

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html 

## Why Natural Language Processing (NLP)?
### Subfield of Data Science
### Interactive Interface between Human and Machine
### Goal: Computer “understand” natural language and perform task(appointment, buying, Q&A etc.)

![image.png](attachment:image.png)

### Major Data Sources for analytics: Structured (20%) vs. Unstructured (80%)
### Unstructured: video, texts, images, audios, emails, social media, websites, reports, feedback systems, clinical notes, document, ppts files, surveys, IoTs etc. --(Gartner, IDC, Surveys)

![Screen%20Shot%202021-09-23%20at%2011.22.14%20AM.png](attachment:Screen%20Shot%202021-09-23%20at%2011.22.14%20AM.png)

### 6. Text Analysis
### Understanding word representation, processing and analysis...
#### Fundamental Techniques: Text Pre-Processing
##### Tokenization, Stemming (cut down to root word), Lemmatization (Similar but consider the morphological analysis of the words),  POS tagging part-of-speech tags ( nouns, verbs, adjectives, adverbs, etc.), Chunking, Named Entity Recognition,(Ex: person names, organizations, locations, time, date etc.)

In [None]:
#!pip install nltk==3.5
#!pip install numpy matplotlib

https://www.nltk.org/ 

### import the relevant parts of NLTK so you can tokenize by word and by sentence

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
sample_text = """Same great ice cream flavor and friendly service 
as in the S 18th street location. 
This location is not as small but it's hard to talk to friends. 
Thankfully there is great outdoor seating to escape the noise.."""

In [None]:
sent_tokenize(sample_text)

In [None]:
tokenized_word = word_tokenize(sample_text)
print(tokenized_word)

## Frequency Distribution

In [None]:
from nltk.probability import FreqDist
tokenized_word = word_tokenize(sample_text)
fdist = FreqDist(tokenized_word)
print(fdist)

In [None]:
fdist.most_common(10)

In [None]:
# Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

## Stop word Removal
Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words("english"))
print(stop_words)

In [None]:
filtered =[]
for w in tokenized_word:
    if w not in stop_words:
        filtered.append(w)
print("Tokenized words:",tokenized_word)
print("\nFilterd words:",filtered)

## Stemming
Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes.

In [None]:
# Stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words=[]
for w in filtered:
    stemmed_words.append(ps.stem(w))

print("Filtered words:",filtered)
print("Stemmed words:",stemmed_words)

## Lemmatization
It reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. ,<b>Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. 

In [None]:
#Lexicon Normalization using Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
lem = WordNetLemmatizer()

lemmatized_words=[]
for w in filtered:
    lemmatized_words.append(lem.lemmatize(w))

print("Filtered words:",filtered)
print("\nStemmed words:",stemmed_words)
print("\nLemmatized Word:",lemmatized_words)


## POS tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.
https://www.guru99.com/pos-tagging-chunking-nltk.html

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:

print("original text with everything:", nltk.pos_tag(tokenized_word))
print("\ncleaner text after stop word removal", nltk.pos_tag(filtered))
print("\ncleaner text after stemming", nltk.pos_tag(stemmed_words))
print("\ncleaner text after lemmatization", nltk.pos_tag(lemmatized_words))


## Chunking
While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.

A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

## More on String Parsing and Transformation

## Translate and Replace

Python has built-in functions in the string module which perform the desired tasks.

the maketrans() method to create a mapping table.

translate() uses symbols map to delete or change specific symbols

Translate does an orderly character-by-character substitution in a string.


In [None]:
sample_text = """Some not so great ice cream flavor and medicore 
service as in the S 18th street location. 
This location is not as small but it's hard to talk to friends. 
Thankfully, my favorite ice cream store is just next door ;-)"""

translation_emoji = str.maketrans(';-)', '   ')
print(sample_text)

sample_text.translate(translation_emoji)

In [None]:
#The Replace function replaces one value in a string with another.
sample_text.replace( ';-)', '')

## String sanitizing

In [None]:
test_string_with_garbage = """Some not so great ice cream flavor and medicore
\nservice as in the S 18th street location.
\nThis location is not as small but it's hard to talk to friends.
\nThankfully, my favorite ice cream store is\tjust next door \r\n   """
character_map = {
 ord('\n') : ' ',
 ord('\t') : ' ',
 ord('\r') : None
}
test_string_with_garbage.translate(character_map)


In [None]:
test_string_with_garbage.split()

And that's it!