# This notebook covers  
1. Operations with strings
2. Pattern matching with regular expressions
3. Web scrapping (requests, urllib, Beautiful Soup)
4. Data Serialization (simplejson and pickle)
5. Input/Output (file-systems and database systems)

## 1. [Operations with Strings](https://docs.python.org/3/library/string.html)   

A string is a sequence of characters. 
All the string methods always return new values and do not change or manipulate the original string.  


In [None]:
text="""
A string is a sequence of characters. 
All the string Methods always return NEW values and do not change or manipulate the original string.
"""
word='characters'

print(word.center(50,'.'))
print("Occurences of word 'string' in the text : {}".format(text.count('string')))
print("First instance (index) of word 'string' in the text: {}".format(text.find('string')))
print('text swapped cases: {}'.format(text.swapcase()))
print('Capitalize every first letter of words in text: {}'.format(text.title()))
print('***... Sentence ..$%\n'.strip('*.%$'))
print("another sentence\n".zfill(50))
print('***... Sentence ..$%'+' another sentence\n')
print(text[text.index('string',15):text.find('the',text.index('string',15))])

#### other string methods include: 
- capitalize( ) - capitalizes only the first character of a string.  
- upper() - capitalizes all the letters of the string.  
- split() - returns a list of words in a string where default separator is any whitespace.  
- startswith() - returns True if the string starts with the specified value; otherwise, it returns False.  
- endswith() - returns True if the string endswith the specified value, else it returns False.  
- ljust() - returns a left-justified version of the given string using a specified character, whitespace being default.   
- rjust() - aligns the string to the right.  
- strip() - returns a copy of the string with the leading and trailing characters removed. Default character to be removed is whitespace.  
- zfill() - adds zeros(0) at the beginning of the string. The length of the returned string depends on the width provided.  

## 2. Pattern matching with [regular expressions](https://docs.python.org/3/howto/regex.html)  

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.  
There are various characters, which would have special meaning when they are used in regular expression. To avoid any confusion while dealing with regular expressions, we would use Raw Strings as **r'expression'.**

In [None]:
import re

t = re.match(r'text','text123')
print(t.group(),t.span())

s= re.search(r'text','.... 08text98.....')
print(s.group(),s.span())

n = re.search(r'\W+','8773 .... text 11 test 254')
print(n.group(),n.span())

p = re.compile(r'\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')

In [None]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")


In [None]:
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print ("searchObj.group() : ", searchObj.group())
   print ("searchObj.group(1) : ", searchObj.group(1))
   print ("searchObj.group(2) : ", searchObj.group(2))
else:
   print ("Nothing found!!")

In [None]:
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
   print ("search --> searchObj.group() : ", searchObj.group())
else:
   print ("Nothing found!!")

In [None]:
#Search and Replace
phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print ("Phone Num : ", num)

## 3. Web Scrapping (Data Mining)

### requests module  


In [None]:
import requests

events = requests.get('https://api.github.com/events')
events.json()[0]['type']

In [None]:
users = requests.get('https://reqres.in/api/users?page=2')
users.json()

In [None]:
user={
    "name": "test254",
    "job": "developer"
}
req = requests.post('https://reqres.in/api/users',data=user)
print(req.status_code,'\n',req.text)

### [urllib](https://docs.python.org/3/howto/urllib2.html) module

In [None]:
from urllib import request

req= request.urlopen('https://www.entrepreneur.com/article/240492')
req.read()

In [None]:
story = request.urlopen('https://www.lifehack.org/articles/communication/10-famous-failures-that-will-inspire-you-success.html')
strory.read()

In [None]:
# request with headers 
headers ={
    "method":"GET",
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "cookie":"crfgL0cSt0r=true; __cfduid=d6c34bb6be1b99a723a8ad3cd2bb8b98c1569933817; cf_2355_id=b00e0659-ea5c-4f36-a24f-fbc7f24c50d7; cf_2355_cta_33455=44861; cf_2355_cta_33454=44863; cf_2355_cta_33338=44864; cf_2355_cta_33456=44867; cf_2355_cta_33463=44858; cf_2355_cta_33048=44802; cf_2355_cta_32649=44868; cf_2355_cta_27270=44806; cf_2355_cta_27306=44807; cf_2355_cta_30789=44809; cf_2355_cta_31153=44342; cf_2355_cta_28512=44078; cf_2355_cta_29689=44799; cf_2355_cta_18329=22701; cf_2355_cta_18686=undefined; cf_2355_cta_18637=34458; cf_2355_cta_19039=37195; cf_2355_cta_27671=44559; cf_2355_cta_28869=undefined; cf_2355_cta_29181=39318; cf_2355_cta_30191=41129; cf_2355_cta_30858=39180; cf_2355_cta_31888=40543; cf_2355_cta_33654=42927; cf_2355_cta_33856=43205; cf_2355_cta_33921=43290; cf_2355_cta_34002=43416; cf_2355_cta_34007=44795; cf_2355_cta_35057=45000; gdpr=consented; euconsent=BOnwOQYOnwOmMAAAAAAACa-AAAAo1rv___7__9_-____9uz7Ov_v_f__33e87_9v_h_7_-___u_-3zd4u_1vf99yfm1-7ctr3tp_87uesm_Xur__59__3z3_9phPr8k89r6337EwwEA; consentId=afbd44ce-7565-49ae-a3a5-8dd571b53b5b; pgsrc=pghb.lifehack.article.js; ac_enable_tracking=1; _ga=GA1.2.1760954512.1569933967; _gid=GA1.2.2017657937.1569933967; _fbp=fb.1.1569933967396.1686572414; _hjid=45bf0821-e9e3-4d65-a416-a16ec3d68baf; m2session=a39c11ad-7e05-482a-bb69-bdd8bac1b649; session_depth=1; m2hb=enabled; df_active=true; __gads=ID=fb106cd0e6bc8b9d:T=1569933976:S=ALNI_MaGbDrifNKtkrOQ4_Z1HqYwGPSvJA; cf_2355_person_time=1569933978471",
    "User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
}
story = request.urlopen('https://www.lifehack.org/articles/communication/10-famous-failures-that-will-inspire-you-success.html')
strory.read()

In [None]:
import requests
req=requests.get('https://www.lifehack.org/articles/communication/10-famous-failures-that-will-inspire-you-success.html')
req.text

### [beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup)  

Beautiful Soup is a Python library for pulling data out of HTML and XML files. 
Web scraping refer to automatic extraction data and presentation of it in a format a human can easily make sense of. Web scraping can be used in a wide variety of situations.  

#### Scraping Rules
1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
2. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [None]:
from bs4 import BeautifulSoup
import urllib
# get page content, use requests module in place of urllib for better experience
page =  urllib.request.urlopen('https://americanliterature.com/100-great-short-stories')
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page,'html.parser')
soup

In [None]:
import requests

page = requests.get('https://www.thestorycorner.org/blogs/news')
soup = BeautifulSoup(page.text,'html.parser')
article_info = soup.find('div', attrs={'class': 'article-info-inner'})
print(article_info)

In [None]:
print(article_info.text)

In [None]:
body=soup.find('body')
print(body.text)

In [None]:
links=[]
for link in soup.find_all('a'):
    links.append(link.get('href'))

In [None]:
links

In [None]:
print(soup.get_text())

## 4. Data Serialization   

### [simplejson](https://simplejson.readthedocs.io/en/latest/)   

simplejson is a simple, fast, complete, correct and extensible [JSON](http://json.org/) encoder and decoder. 
> The encoder can be specialized to provide serialization in any kind of situation, without any special support by the objects to be serialized (somewhat like pickle). This is best done with the default kwarg to dumps.  
The decoder can handle incoming JSON strings of any specified encoding (UTF-8 by default).  



In [None]:
import simplejson as json

data={'3': 5, '1': 7,'fruits':['passion', 'strawberry', 'mangoe','plum']}

print(json.dumps(data, sort_keys=True, indent=4 * ' '))

with open('json_data.json','w') as json_in:
    json.dump(data,json_in)
    
with open('json_data.json','r') as json_out:
    json_data=json.load(json_out)
    
print('JSON data: {}'.format(json_data))

### [pickle](https://docs.python.org/3/library/pickle.html)   

Pickling (serialization with pickle) is the act of translating python data objects into a format that can be transferred from RAM to disk. pickled Python objects can be easily unpickled back to their original form (deserialisation).  
Although JSON format is human-readable, language-independent, and faster than pickle, only a limited subset of Python built-in types can be represented by JSON. With Pickle, we can easily serialize a very large spectrum of Python types, and, importantly, custom classes. This means we don't need to create a custom schema (like we do for JSON) and write error-prone serializers and parsers. All of the heavy liftings is done for you with Pickle.  

#### What can be Pickled and Unpickled
- All native datatypes supported by Python (booleans, None, integers, floats, complex numbers, strings, bytes, byte arrays)
- Dictionaries, sets, lists, and tuples - as long as they contain pickleable objects
- Functions and classes that are defined at the top level of a module  

> It is important to noter that pickling is not a language-independent serialization method, therefore your pickled data can only be unpickled using Python. Moreover, it's important to make sure that objects are pickled using the same version of Python that is going to be used to unpickle them. Mixing Python versions, in this case, can cause many problems.  
Additionally, functions are pickled by their name references, and not by their value. The resulting pickle does not contain information on the function's code or attributes. Therefore, you have to make sure that the environment where the function is unpickled is able to import the function. In other words, if we pickle a function and then unpickle it in an environment where it's either not defined or not imported, an exception will be raised.  
#### It is also very important to note that pickled objects can be used in malevolent ways. For instance, unpickling data from an untrusted source can result in the execution of a malicious piece of code.

In [None]:
#pickling (serializing)

import pickle

fruits = ['passion', 'strawberry', 'mangoe','plum']

with open('fruits_pickle.pkl', 'wb') as pickle_out:
    pickle.dump(fruits, pickle_out)

In [None]:
#unpickling (deserialization)

with open('fruits_pickle.pkl', 'rb') as pickle_in:
    unpickled_data = pickle.load(pickle_in)

print(unpickled_data)

In [None]:
#Pickling and Unpickling Custom Objects

class NeuralNetwork():
    def __init__(self):
        self.activated = False
    def activate(self):
        self.activated = True
    def set_traing_data(self,data):
        self.traing_data=data
    

model = NeuralNetwork()
model.activate()
model.traing_data=unpickled_data

with open('object_pickle', 'wb') as pickle_out:
    pickle.dump(model, pickle_out)

with open('object_pickle', 'rb') as pickle_in:
    unpickled_object = pickle.load(pickle_in)

print('Activated: {} \nTraining Data: {}'.format(unpickled_object.activated,unpickled_object.traing_data))

**Remember, that we can only unpickle the object in an environment where the class NeuralNetwork is either defined or imported. If we create a new script and try to unpickle the object without importing the NeuralNetwork class, we'll get an "AttributeError".**

## 5. Input and Output (IO)  

### [csv file](https://docs.python.org/3/library/csv.html)   

In [None]:
from simplejson import loads,dumps
import csv

with open('Highway1.csv','r') as csvfile:
    data_list=csv.reader(csvfile,delimiter=',')
    for row in data_list:
        print(row)

In [None]:
with open('Highway1.csv','r') as csvfile:
    data_dict=csv.DictReader(csvfile,delimiter=',')
    for row in data_dict:
        print(row)

In [None]:
with open('json_data.json','r') as jsonfile:
    data=loads(jsonfile.read())
    
print(data)

In [None]:
from requests import get

request = get('https://api.github.com/events')
data = request.json()
fieldnames=list(data[0].keys())

with open('github_events_data.csv','w') as csvfile:
    writer=csv.DictWriter(csvfile,fieldnames=fieldnames)
    writer.writeheader()
    try:
        writer.writerows(data)
    except Exception:
        pass

with open('github_events_data.json','w') as jsonfile:
    jsonfile.write(dumps(data))
    
with open('github_events_data.json') as jsonfile:
    print(loads(jsonfile.read()))

### [excel files](http://www.python-excel.org/)    


**Writing to excel using [xlsxwriter](https://xlsxwriter.readthedocs.io/)**  

In [None]:
import xlsxwriter

# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses.xlsx')
worksheet = workbook.add_worksheet()

# Some data we want to write to the worksheet.
expenses = (
    ['Rent', 1000],
    ['Gas',   100],
    ['Food',  300],
    ['Gym',    50],
)

# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# Iterate over the data and write it out row by row.
for item, cost in (expenses):
    worksheet.write(row, col,     item)
    worksheet.write(row, col + 1, cost)
    row += 1

# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')

workbook.close()

In [None]:
from datetime import datetime
import xlsxwriter

# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses-mixed.xlsx')
worksheet = workbook.add_worksheet()

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': 1})

# Add a number format for cells with money.
money_format = workbook.add_format({'num_format': '$#,##0'})

# Add an Excel date format.
date_format = workbook.add_format({'num_format': 'mmmm d yyyy'})

# Adjust the column width.
worksheet.set_column(1, 1, 15)

# Write some data headers.
worksheet.write('A1', 'Item', bold)
worksheet.write('B1', 'Date', bold)
worksheet.write('C1', 'Cost', bold)

# Some data we want to write to the worksheet.
expenses = (
 ['Rent', '2013-01-13', 1000],
 ['Gas',  '2013-01-14',  100],
 ['Food', '2013-01-16',  300],
 ['Gym',  '2013-01-20',   50],
)

# Start from the first cell below the headers.
row = 1
col = 0

for item, date_str, cost in (expenses):
    # Convert the date string into a datetime object.
    date = datetime.strptime(date_str, "%Y-%m-%d")

    worksheet.write_string  (row, col,     item              )
    worksheet.write_datetime(row, col + 1, date, date_format )
    worksheet.write_number  (row, col + 2, cost, money_format)
    row += 1

# Write a total using a formula.
worksheet.write(row, 0, 'Total', bold)
worksheet.write(row, 2, '=SUM(C2:C5)', money_format)

workbook.close()

**Writing to and reading from excel using [openpyxl](https://openpyxl.readthedocs.io/en/stable/index.html)**  

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter

wb = Workbook()
dest_filename = 'openpyxl_workbook.xlsx'
ws1 = wb.active
ws1.title = "range names"
for row in range(1, 40):
    ws1.append(range(600))
ws2 = wb.create_sheet(title="Pi")
ws2['F5'] = 3.14
ws3 = wb.create_sheet(title="Data")
for row in range(10, 20):
    for col in range(27, 54):
        _ = ws3.cell(column=col, row=row, value="{0}".format(get_column_letter(col)))
wb.save(filename = dest_filename)

In [None]:
from openpyxl import load_workbook
wb = load_workbook(filename = 'openpyxl_workbook.xlsx')
sheet_ranges = wb['range names']
print(sheet_ranges['D18'].value)

### IO operations using database systems  

> Install [PostgreSQL](https://www.postgresql.org/download/) database and create 'testing user'. We will follow [this installation guide](https://www.postgresql.org/download/linux/ubuntu/) to install on ubuntu 18.04

- Create the file **/etc/apt/sources.list.d/pgdg.list**  
`$ sudo nano /etc/apt/sources.list.d/pgdg.list`  
- add a line for the repository  
`deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main`  
- Exit editor with Ctrl+X, save edits.
- Import the repository signing key, and update the package lists   
`$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -`    
`$ sudo apt-get update`   
- Install postgresql database server   
`$ sudo apt-get install postgresql`   
- Login to database server as postgres user, create user nlpdemo and create database nlpcrash owned by user nlpdemo  
`$ sudo su postgres`  
`$ psql`  
`> create role nlpdemo with login password 'UserPassword';`  
`> create database nlpcrash owner nlpdemo;`
`>\q`  
`$ exit`  



We will use [psycopg2](http://initd.org/psycopg/docs/index.html) to connect to our database server. Psycopg2 is the most popular PostgreSQL database adapter for the Python programming language.  


In [6]:
import psycopg2

db_conf={
    "user":"nlpdemo",
    "password":"UserPassword",
    "host":"localhost",
    "port":5432,
    "database":"nlpcrash"
}

# Connect to an existing database
conn = psycopg2.connect("dbname={} user={} host={} port={} password={}".format(
    db_conf['database'],
    db_conf['user'],
    db_conf['host'],
    db_conf['port'],
    db_conf['password']
))

# Open a cursor to perform database operations
cur = conn.cursor()

# Execute a command: this creates a new table
cur.execute("CREATE TABLE test (id serial PRIMARY KEY, num integer, data varchar);")

# Pass data to fill a query placeholders and let Psycopg perform
# the correct conversion (no more SQL injections!)
cur.execute("INSERT INTO test (num, data) VALUES (%s, %s)",(100, "abc'def"))

# Query the database and obtain data as Python objects
cur.execute("SELECT * FROM test;")
print(cur.fetchall())

# Make the changes to the database persistent
conn.commit()

# Close communication with the database
cur.close()
conn.close()

[(1, 100, "abc'def"), (2, 100, "abc'def"), (3, 100, "abc'def")]


**Follow [this guide](http://initd.org/psycopg/docs/usage.html) that details advanced usage of psycopg2 package**