<h1>Collecting Text Data from the Web</h1>
<br>
<h2>Collecting Data by Scraping Web Pages</h2>

<b>Insert a new cell and add the following code to import the <code>BeautifulSoup</code> library</b>

In [1]:
from bs4 import BeautifulSoup

<b>Then, we create an object of the <code>BeautifulSoup</code> class and pass the location of the HTML file to it</b>

In [2]:
soup = BeautifulSoup(open('data/sample_doc.html'), 'html.parser')

<b>Add the following code to check the <code>text</code> contents of the <code>sample_doc.html</code> file</b>

In [3]:
soup.text

'\n\n\n A sample HTML Page \n\n\nI am staying at  Mess on No. 72, Banamali Naskar Lane, Kolkata. \nSherlock  stays at 221B, Baker Street, London, UK. \nHamlet said to Horatio,   There are more things in heaven and earth, Horatio,  Than are dreamt of in your philosophy. \n A table denoting details of students\n\n\nname\nqualification\nadditional qualification\nother qualification\n\n\nGangaram\nB.Tech\nNA\nNA\n\n\nGanga\nB.A.\nNA\nNA\n\n\nRam\nB.Tech\nM.Tech\nNA\n\n\nRamlal\nB.Music\nNA\nDiploma in Music\n\n\n\n'

<b>To check the <code>address</code> tag, we insert a new cell and add the following code</b>

In [4]:
soup.find('address')

<address> Mess on No. 72, Banamali Naskar Lane, Kolkata.</address>

<b>To locate all the <code>address</code> tags within the given content, write the following code</b>

In [5]:
soup.find_all('address')

[<address> Mess on No. 72, Banamali Naskar Lane, Kolkata.</address>,
 <address>221B, Baker Street, London, UK.</address>]

<b>To check <code>quotes</code> within the content, we write the following code</b>

In [6]:
soup.find_all('q')

[<q> There are more things in heaven and earth, Horatio, <br/> Than are dreamt of in your philosophy. </q>]

<b>To check all the <code>bold</code> items, we write the following command</b>

In [7]:
soup.find_all('b')

[<b>Sherlock </b>, <b>Hamlet</b>, <b>Horatio</b>]

<b>To check all the contents inside the <code>table</code> tag, we write the following command</b>

In [8]:
table = soup.find('table')

<b>We can also view the content of <code>table</code> by looping through it. Insert a new cell and add the following code to implement this</b>

In [9]:
for row in table.find_all('tr'):
    columns = row.find_all('td')
    print(columns)

[]
[<td>Gangaram</td>, <td>B.Tech</td>, <td>NA</td>, <td>NA</td>]
[<td>Ganga</td>, <td>B.A.</td>, <td>NA</td>, <td>NA</td>]
[<td>Ram</td>, <td>B.Tech</td>, <td>M.Tech</td>, <td>NA</td>]
[<td>Ramlal</td>, <td>B.Music</td>, <td>NA</td>, <td>Diploma in Music</td>]


<b>We can also locate specific content in the table. If we want to locate the value of the third row and the second column, we write the following command</b>

In [10]:
table.find_all('tr')[3].find_all('td')[2]

<td>M.Tech</td>

<h2>Requesting Content from Web Pages</h2>

<b>Use the <code>requests</code> library to request the content of a book available online with the following set of commands</b>

In [11]:
import requests

In [12]:
r = requests.post('https://www.gutenberg.org/files/766/766-0.txt')
r.status_code

200

Here, 200 indicates that we received a proper response from the URL

<b>To locate the text content of the fetched file, write the following code</b>

In [13]:
r.text[:1000]

'ï»¿The Project Gutenberg EBook of David Copperfield, by Charles Dickens\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: David Copperfield\r\n\r\nAuthor: Charles Dickens\r\n\r\nRelease Date: December, 1996  [Etext #766]\r\nPosting Date: November 24, 2009\r\nLast Updated: September 25, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK DAVID COPPERFIELD ***\r\n\r\n\r\n\r\n\r\nProduced by Jo Churcher\r\n\r\n\r\n\r\n\r\n\r\nDAVID COPPERFIELD\r\n\r\n\r\nBy Charles Dickens\r\n\r\n\r\n\r\n               AFFECTIONATELY INSCRIBED TO\r\n               THE HON.  Mr. AND Mrs. RICHARD WATSON,\r\n               OF ROCKINGHAM, NORTHAMPTONSHIRE.\r\n\r\n\r\nCONTENTS\r\n\r\n\r\n     I.      I Am Born\r\n    

<b>Now we'll write the fetched content to a text file. To do that, add the following code</b>

In [14]:
open("data/David_Copperfield.txt", 'w', encoding="utf-8").write(r.text)

2033139

<b>Now we'll make use of the <code>urllib3</code> library to request the content of the book, available online. Add the following code to do so</b>

In [15]:
import urllib3
http = urllib3.PoolManager()
rr = http.request('GET', 'https://www.gutenberg.org/files/766/766-0.txt')
rr.data[:1000]



b'\xef\xbb\xbfThe Project Gutenberg EBook of David Copperfield, by Charles Dickens\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: David Copperfield\r\n\r\nAuthor: Charles Dickens\r\n\r\nRelease Date: December, 1996  [Etext #766]\r\nPosting Date: November 24, 2009\r\nLast Updated: September 25, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK DAVID COPPERFIELD ***\r\n\r\n\r\n\r\n\r\nProduced by Jo Churcher\r\n\r\n\r\n\r\n\r\n\r\nDAVID COPPERFIELD\r\n\r\n\r\nBy Charles Dickens\r\n\r\n\r\n\r\n               AFFECTIONATELY INSCRIBED TO\r\n               THE HON.  Mr. AND Mrs. RICHARD WATSON,\r\n               OF ROCKINGHAM, NORTHAMPTONSHIRE.\r\n\r\n\r\nCONTENTS\r\n\r\n\r\n     I.      I Am Bo

<b>Once the content is fetched properly, we write it to a text file using the following code</b>

In [16]:
open('data/David_Copperfield_new.txt', 'wb').write(rr.data)

2033139

<h1>Analyzing the Content of Jupyter Notebooks (in HTML Format)</h1>
<br>
In this exercise, we will analyze the content of t<code>ext_classifier.html</code>. Here, we will focus on things such as counting the number of images, listing the packages that have been imported, and checking models and their performance

In [17]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('data/text_classifier.html'), 'html.parser')
soup.text[:100]

'\n\n\nCh3_Activity7_Developing_end_to_end_Text_Classifiers\n\n\n\n    /*!\n*\n* Twitter Bootstrap\n*\n*/\n/*!\n *'

<b>To count the number of images, we make use of the <code>img</code> tag</b>

In [18]:
len(soup.find_all('img'))

3

<b>To list all the packages that are imported, we add the following code</b>

In [19]:
[i.get_text() for i in soup.find_all('span',attrs={"class":"nn"})]

['pandas',
 'pd',
 'seaborn',
 'sns',
 'matplotlib.pyplot',
 'plt',
 're',
 'string',
 'nltk',
 'nltk.corpus',
 'nltk.stem',
 'sklearn.feature_extraction.text',
 'sklearn.model_selection',
 'pylab',
 'nltk',
 'sklearn.metrics',
 'sklearn.linear_model',
 'sklearn.ensemble',
 'xgboost']

<b>To extract the models and their performances, look for the <code>h2</code> and <code>div</code> tags with the <code>class</code> attribute</b>

In [20]:
for md, i in zip(soup.find_all('h2'), soup.find_all('div', attrs={"class":"output_subarea output_stream output_stdout output_text"})):
    print("Model: ", md.get_text())
    print(i.get_text())
    print("---------------------------------------------------------\n\n\n")

Model:  Logistic Regression¶

confusion matrix: 
 [[28705   151]
 [ 1663  1396]]

accuracy:  0.943161522794924

classification report: 
               precision    recall  f1-score   support

           0       0.95      0.99      0.97     28856
           1       0.90      0.46      0.61      3059

   micro avg       0.94      0.94      0.94     31915
   macro avg       0.92      0.73      0.79     31915
weighted avg       0.94      0.94      0.93     31915


Area under ROC curve for validation set: 0.911224422146723


---------------------------------------------------------



Model:  Random Forest¶

confusion matrix: 
 [[28856     0]
 [ 2990    69]]

accuracy:  0.9063136456211812

classification report: 
               precision    recall  f1-score   support

           0       0.91      1.00      0.95     28856
           1       1.00      0.02      0.04      3059

   micro avg       0.91      0.91      0.91     31915
   macro avg       0.95      0.51      0.50     31915
weighted 

<h2>Extracting Information from an Online HTML Page</h2

In this activity, we will extract data about Rabindranath Tagore from a Wikipedia page. After extracting the data, we will analyze things such as the list of headings under the Works section, the list of his works, and the list of universities named after him.
<br>
<br>
Read the wikipedia page about Rabindranath Tagore. Extract the following information from it:
<br>
<ul>
    <li>List of headings under Section Works.</li>
    <li>List of his Works.</li>
    <li>List of Universities named after him.</li>
</ul>

<b>Import the <code>requests</code> and <code>BeautifulSoup</code> libraries</b>

In [23]:
import requests
from bs4 import BeautifulSoup

<b>Fetch the Wikipedia page from https://bit.ly/1ZmRIPC the <code>get</code> method of the <code>requests</code> library.</b>

In [26]:
r = requests.get('https://en.wikipedia.org/wiki/Rabindranath_Tagore')
r.status_code

200

<b>Convert the fetched content into <code>HTML</code> format using an <code>HTML</code> parser</b>

In [27]:
soup = BeautifulSoup(r.text, 'html.parser')

<b>Print the list of headings under the Works section</b>

In [30]:
for ele in soup.find_all('h3')[:6]:
    tx = BeautifulSoup(str(ele),'html.parser').find('span', attrs={'class':"mw-headline"})
    if tx is not None:
        print(tx['id'])

Drama
Short_stories
Novels
Poetry
Songs_(Rabindra_Sangeet)
Art_works


In [29]:
soup.find_all('h3')[:6]

[<h3><span class="mw-headline" id="Drama">Drama</span></h3>,
 <h3><span class="mw-headline" id="Short_stories">Short stories</span></h3>,
 <h3><span class="mw-headline" id="Novels">Novels</span></h3>,
 <h3><span class="mw-headline" id="Poetry">Poetry</span></h3>,
 <h3><span id="Songs_.28Rabindra_Sangeet.29"></span><span class="mw-headline" id="Songs_(Rabindra_Sangeet)">Songs (Rabindra Sangeet)</span></h3>,
 <h3><span class="mw-headline" id="Art_works">Art works</span></h3>]

<b>Print the list of works by Tagore</b>

In [32]:
table = soup.find_all('table')[1]
for row in table.find_all('tr'):
    columns = row.find_all('td')
    if len(columns)>0:
        columns = columns[1:]
        print(BeautifulSoup(str(columns), 'html.parser').text.strip())

[Bhānusiṃha Ṭhākurer Paḍāvalī, (Songs of Bhānusiṃha Ṭhākur), 1884
]
[Manasi, (The Ideal One), 1890
]
[Sonar Tari, (The Golden Boat), 1894
]
[Gitanjali, (Song Offerings), 1910
]
[Gitimalya, (Wreath of Songs), 1914
]
[Balaka, (The Flight of Cranes), 1916
]
[Valmiki-Pratibha, (The Genius of Valmiki), 1881
]
[Kal-Mrigaya, (The Fatal Hunt), 1882
]
[Mayar Khela, (The Play of Illusions), 1888
]
[Visarjan, (The Sacrifice), 1890
]
[Chitrangada, (Chitrangada), 1892
]
[Raja, (The King of the Dark Chamber), 1910
]
[Dak Ghar, (The Post Office), 1912
]
[Achalayatan, (The Immovable), 1912
]
[Muktadhara, (The Waterfall), 1922
]
[Raktakarabi, (Red Oleanders), 1926
]
[Chandalika, (The Untouchable Girl), 1933
]
[Nastanirh, (The Broken Nest), 1901
]
[Gora, (Fair-Faced), 1910
]
[Ghare Baire, (The Home and the World), 1916
]
[Yogayog, (Crosscurrents), 1929
]
[Jivansmriti, (My Reminiscences), 1912
]
[Chhelebela, (My Boyhood Days), 1940


]


<b>Print the list of universities named after Tagore</b>

In [33]:
[BeautifulSoup(str(i),'html.parser').text.strip() for i in soup.find('ol') if i!='\n']

['Rabindra Bharati University, Kolkata, India.',
 'Rabindra University, Sahjadpur, Shirajganj, Bangladesh.[1]',
 'Rabindranath Tagore University, Hojai, Assam, India',
 'Rabindra Maitree University, Courtpara, Kustia,Bangladesh.[2]',
 'Bishwakabi Rabindranath Tagore Hall, Jahangirnagar University, Bangladesh',
 'Rabindra Nazrul Art Building, Arts Faculty, Islamic University, Bangladesh',
 'Rabindra Library (Central), Assam University, India',
 'Rabindra Srijonkala University, Keraniganj, Dhaka, Bangladesh']

<h2>Extracting and Analyzing Data Using Regular Expressions</h2>
<br>
In this activity, we will extract data from Packt's website. The data to be extracted includes FAQs and their answers, phone numbers, and emails. Follow these steps to implement this activity
<br>
<br>
Extract the following from Packt website <br>
<ul>
    <li>FAQs and their answers from 
        <a href='https://www.packtpub.com/books/info/packt/faq'>
            https://www.packtpub.com/books/info/packt/faq</a>.</li>
    <li>Phone numbers and emails from
        <a href='https://www.packtpub.com/books/info/packt/terms-and-conditions'>
            https://www.packtpub.com/books/info/packt/terms-and-conditions 
        </a></li>
</ul>
<b>Import the necessary libraries</b>

In [35]:
r = requests.get('https://www.packtpub.com/books/info/packt/faq')
r.status_code

200

<b>Extract data from https://bit.ly/2uw0Avf the urllib3 library</b>

In [59]:
http = urllib3.PoolManager()
rr = http.request('GET', 'https://www.packtpub.com/books/info/packt/faq')
rr.status



200

<b>Fetch questions and answers from the data</b>

In [60]:
soup = BeautifulSoup(rr.data, 'html.parser')

In [123]:
questions = [question.text.split("\n")[0].lstrip() for question in soup.find_all('div', attrs={"class": "tab"})]
questions

['How can I download eBooks?',
 'What format are Packt eBooks?',
 'Can I send an eBook to my Kindle?',
 'How can I download code files for eBooks and Videos?',
 'How can I download Videos?',
 'How can I gift an eBook/Video/Course/Packt subscription?',
 'Can I send an eBook to my Kindle?',
 'What are the different types of courses available on Packt website?',
 'Which courses are accessible with the subscription?',
 'What are assessments? How can I access them?',
 'Where will I get the answers to the assessments?',
 'Does the course contain any text content?',
 'How can I access the text content?',
 'What is an Integrated Course?',
 'If I complete a course, will I get any certification?',
 'How do I download a Video course?',
 'Is "Readium" required to open certain blended courses?']

In [125]:
answers = [answer.text.strip() for answer in soup.find_all('div',attrs={"class":"tab-content"})]
answers

['Once you complete your eBook purchase, the download link for your eBook will be available in your Packt account. You can access your eBook by following the steps below:\n\nLogin to your account\nClick on "My Account"\nClick on "My owned products"\nDownload the eBook in your desired format.\n\nIf you own an eBook and are viewing the product page you can also download it from there\nIf you have purchased an Early Access eBook?title you can only download the published chapters?from your account or read them online with an active subscription. You can download the complete eBook, only once the eBook is published.',
 'Packt eBooks can be downloaded as a PDF, EPUB or MOBI file. They can also be viewed online using your subscription.',
 'Yes, if you follow the previous instructions on how to download an eBook and select "Send to Kindle" you will be able to enter your Kindle details and send the file. There is however a 30MB limit on sending files to Kindle.',
 "There are a number of simple 

<b>Create a DataFrame consisting of questions and answers</b>

In [126]:
import pandas as pd
pd.DataFrame({'questions':questions, 'answers':answers}).head()

Unnamed: 0,questions,answers
0,How can I download eBooks?,"Once you complete your eBook purchase, the dow..."
1,What format are Packt eBooks?,"Packt eBooks can be downloaded as a PDF, EPUB ..."
2,Can I send an eBook to my Kindle?,"Yes, if you follow the previous instructions o..."
3,How can I download code files for eBooks and V...,There are a number of simple ways to access Co...
4,How can I download Videos?,"Once you complete your Video purchase, the dow..."


<b>Fetch email addresses and phone numbers with the help of regular expressions</b>

In [127]:
rr_tc = http.request('GET', 'https://www.packtpub.com/books/info/packt/terms-and-conditions')
rr_tc.status



200

In [128]:
soup = BeautifulSoup(rr_tc.data, 'html.parser')

In [129]:
import re
set(re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",soup.text))

{'customercare@packt.com', 'subscription.support@packt.com'}

In [130]:
re.findall(r"\+\d{2}\s{1}\(0\)\s\d{3}\s\d{3}\s\d{3}",soup.text)

['+44 (0) 121 265 648', '+44 (0) 121 212 141']

<h2>Dealing with JSON Files</h2>
<br>
In this exercise, we will extract details such as the names of students, their qualifications, and additional qualifications from a JSON file. Follow these steps to implement this exercise.
<br>
<br>
<b>Insert a new cell and import <code>json</code>. Pass the location of the file mentioned using the following commands</b>

In [133]:
import json
from pprint import pprint
data = json.load(open('data/sample_json.json'))
pprint(data)

{'students': [{'name': 'Gangaram', 'qualification': 'B.Tech'},
              {'name': 'Ganga', 'qualification': 'B.A.'},
              {'additional qualification': 'M.Tech',
               'name': 'Ram',
               'qualification': 'B.Tech'},
              {'name': 'Ramlal',
               'other qualification': 'Diploma in Music',
               'qualification': 'B.Music'}]}


<b>To extract the names of the students, add the following code</b>

In [136]:
[dt['name'] for dt in data['students']]

['Gangaram', 'Ganga', 'Ram', 'Ramlal']

<b>To extract their qualifications, enter the following code</b>

In [137]:
[dt['qualification'] for dt in data['students']]

['B.Tech', 'B.A.', 'B.Tech', 'B.Music']

<b>To extract their additional qualifications, enter the following code. Remember: not every student will have additional qualifications. Thus, we need to check this separately. Add the following code to implement this</b>

In [140]:
[dt['additional qualification'] if 'additional qualification' in dt.keys() else None for dt in data['students']]

[None, None, 'M.Tech', None]

In [141]:
[dt['other qualification'] if 'other qualification' in dt.keys() else None for dt in data['students']]

[None, None, None, 'Diploma in Music']

<h2>Dealing with Online Files</h2>
<br>
In this activity, we will fetch JSON files from online, extract comments, and evaluate the sentiment scores of each of them. We will make use of the TextBlob library. Follow these steps to implement this activity.
<br>
<br>
Extract comments from <a href='https://jsonplaceholder.typicode.com/comments'>https://jsonplaceholder.typicode.com/comments</a> and evaluate sentiment scores of each of them using TextBlob and collect 15 author names and titles by parsing JSON files available from <a href='http://libgen.io/json.php'>http://libgen.io/json.php</a>
<br>
<br>
<b>Import the necessary libraries</b>

In [142]:
import json
import urllib3
from textblob import TextBlob
from pprint import pprint
import pandas as pd

<b>Fetch the data from <a href='https://bit.ly/2TJ1T4H'>https://bit.ly/2TJ1T4H</a> the requests library</b>

In [143]:
http = urllib3.PoolManager()
rr = http.request('GET', 'https://jsonplaceholder.typicode.com/comments')
rr.status



200

In [144]:
data = json.loads(rr.data.decode('utf-8'))

<b>Create a DataFrame from the fetched data</b>

In [145]:
import pandas as pd
df = pd.DataFrame(data).head(15)
df.head()

Unnamed: 0,postId,id,name,email,body
0,1,1,id labore ex et quam laborum,Eliseo@gardner.biz,laudantium enim quasi est quidem magnam volupt...
1,1,2,quo vero reiciendis velit similique earum,Jayne_Kuhic@sydney.com,est natus enim nihil est dolore omnis voluptat...
2,1,3,odio adipisci rerum aut animi,Nikita@garfield.biz,quia molestiae reprehenderit quasi aspernatur\...
3,1,4,alias odio sit,Lew@alysha.tv,non et atque\noccaecati deserunt quas accusant...
4,1,5,vero eaque aliquid doloribus et culpa,Hayden@althea.biz,harum non quasi et ratione\ntempore iure ex vo...


<b>Translate the comments in the data into English</b>

In [146]:
df['body_english'] = df['body'].apply(lambda x: str(TextBlob('u'+str(x)).translate(to='en')))

In [147]:
df[['body', 'body_english']].head()

Unnamed: 0,body,body_english
0,laudantium enim quasi est quidem magnam volupt...,"For them, as it were, is, indeed, the very gre..."
1,est natus enim nihil est dolore omnis voluptat...,"uest was born, all the pain, the pleasure is n..."
2,quia molestiae reprehenderit quasi aspernatur\...,Uquia discomfort criticized as dislikes\nof pr...
3,non et atque\noccaecati deserunt quas accusant...,unon and and the\nof denouncing pleasure and f...
4,harum non quasi et ratione\ntempore iure ex vo...,"not as it were, and by reason of uhari\nat the..."


<b>Make use of the TextBlob library to find the sentiment of each comment and display it</b>

In [148]:
df['sentiment_score'] = df['body_english'].apply(lambda x: str(TextBlob('u'+str(x)).sentiment.polarity))
df[['body_english', 'sentiment_score']]

Unnamed: 0,body_english,sentiment_score
0,"For them, as it were, is, indeed, the very gre...",1.0
1,"uest was born, all the pain, the pleasure is n...",0.0
2,Uquia discomfort criticized as dislikes\nof pr...,0.5
3,unon and and the\nof denouncing pleasure and f...,-0.4166666666666667
4,"not as it were, and by reason of uhari\nat the...",0.3202380952380952
5,"Udolorem at fault, but one which must be aband...",0.0
6,"but in labor and in pain, and in the same, and...",0.4
7,he wishes to become corrupt in the pleasure of...,0.0
8,"discomfort, and at once take usapiente\nso tha...",-0.3388888888888888
9,Uvoluptate regular very important for us to fi...,0.3177777777777777


<h2>Dealing with a Local XML File</h2>
<br>
In this exercise, we will parse an XML file and print various things, such as the names of employees, the organizations they work for, and the total salaries of all employees. Follow these steps to implement this exercise.
<br>
<br>
<b>Insert a new cell, <code>import xml.etree.ElementTree</code>, and pass the location of the XML file using the following code</b>

In [149]:
import xml.etree.ElementTree as ET
tree = ET.parse('data/sample_xml_data.xml')
root = tree.getroot()
root

<Element 'records' at 0x0000024B854F3AE8>

<b>To check the tag of the fetched element, type the following code</b>

In [150]:
root.tag

'records'

<b>Look for the <code>name</code> and <code>company</code> tags in the XML and print the data enclosed within them</b>

In [157]:
for record in root.findall('record')[:20]:
    print("Name: ", record.find('name').text, " Company: ", record.find('company').text, " Salary: ", record.find('salary').text)

Name:  Peter Brewer  Company:  Erat Ltd  Salary:  $5,042
Name:  Wallace Pace  Company:  Sed Nunc Industries  Salary:  $9,290
Name:  Arthur Ray  Company:  Amet Faucibus Corp.  Salary:  $8,199
Name:  Judah Vaughn  Company:  Nunc Quis Arcu Inc.  Salary:  $9,007
Name:  Talon Combs  Company:  Leo Elementum Ltd  Salary:  $9,875
Name:  Hall Bruce  Company:  Proin Non Massa Consulting  Salary:  $6,527
Name:  Ronan Grant  Company:  Scelerisque Sed Inc.  Salary:  $5,507
Name:  Dennis Whitaker  Company:  Scelerisque Neque Foundation  Salary:  $9,196
Name:  Bradley Oconnor  Company:  Aliquet Corporation  Salary:  $9,069
Name:  Forrest Alvarez  Company:  Et Eros Institute  Salary:  $6,012
Name:  Ignatius Meyers  Company:  Facilisis Lorem Limited  Salary:  $9,588
Name:  Bert Randolph  Company:  Facilisis LLP  Salary:  $9,862
Name:  Victor Stevenson  Company:  Lacinia Vitae Sodales Incorporated  Salary:  $5,885
Name:  Jamal Cummings  Company:  Litora Ltd  Salary:  $6,296
Name:  Samson Estrada  Compan

<b>Create a list consisting of the salaries of all employees. Use numpy to find out the sum of the salaries</b>

In [156]:
import numpy as np
np.sum([int(record.find('salary').text.replace('$', '').replace(',','')) for record in root.findall('record')])

745609

<h2>Collecting Data Using APIs</h2>
<br>
In this exercise, we will extract carbon intensities from December 30, 2018, to January 3, 2019, using an API. Follow these steps to implement this exercise.
<br>
<br>
<b>Import the necessary packages</b>

In [158]:
import urllib3

<b>Construct the corresponding URL and call it</b>

In [160]:
http = urllib3.PoolManager()
start_dt = '2018-12-30T12:35Z'
end_dt = '2019-01-03T12:35Z'

In [161]:
rrq = http.request('GET', 'https://api.carbonintensity.org.uk/intensity/'+start_dt+'/'+end_dt, \
                   headers = {'Accept': 'application/json'})
rrq.status



200

<b>Load the json data, insert a new cell, and add the following code to implement this</b>

In [162]:
data = json.loads(rrq.data)
pprint(data)

{'data': [{'from': '2018-12-30T12:30Z',
           'intensity': {'actual': 203, 'forecast': 202, 'index': 'moderate'},
           'to': '2018-12-30T13:00Z'},
          {'from': '2018-12-30T13:00Z',
           'intensity': {'actual': 208, 'forecast': 201, 'index': 'moderate'},
           'to': '2018-12-30T13:30Z'},
          {'from': '2018-12-30T13:30Z',
           'intensity': {'actual': 217, 'forecast': 205, 'index': 'moderate'},
           'to': '2018-12-30T14:00Z'},
          {'from': '2018-12-30T14:00Z',
           'intensity': {'actual': 225, 'forecast': 214, 'index': 'moderate'},
           'to': '2018-12-30T14:30Z'},
          {'from': '2018-12-30T14:30Z',
           'intensity': {'actual': 235, 'forecast': 220, 'index': 'moderate'},
           'to': '2018-12-30T15:00Z'},
          {'from': '2018-12-30T15:00Z',
           'intensity': {'actual': 247, 'forecast': 231, 'index': 'moderate'},
           'to': '2018-12-30T15:30Z'},
          {'from': '2018-12-30T15:30Z',
           '

           'intensity': {'actual': 407, 'forecast': 406, 'index': 'very high'},
           'to': '2019-01-03T11:00Z'},
          {'from': '2019-01-03T11:00Z',
           'intensity': {'actual': 407, 'forecast': 404, 'index': 'very high'},
           'to': '2019-01-03T11:30Z'},
          {'from': '2019-01-03T11:30Z',
           'intensity': {'actual': 404, 'forecast': 404, 'index': 'very high'},
           'to': '2019-01-03T12:00Z'},
          {'from': '2019-01-03T12:00Z',
           'intensity': {'actual': 397, 'forecast': 404, 'index': 'very high'},
           'to': '2019-01-03T12:30Z'}]}


<b>To create the DataFrame of the fetched data and print it, add the following code</b>

In [163]:
pd.DataFrame(data['data'])

Unnamed: 0,from,to,intensity
0,2018-12-30T12:30Z,2018-12-30T13:00Z,"{'forecast': 202, 'actual': 203, 'index': 'mod..."
1,2018-12-30T13:00Z,2018-12-30T13:30Z,"{'forecast': 201, 'actual': 208, 'index': 'mod..."
2,2018-12-30T13:30Z,2018-12-30T14:00Z,"{'forecast': 205, 'actual': 217, 'index': 'mod..."
3,2018-12-30T14:00Z,2018-12-30T14:30Z,"{'forecast': 214, 'actual': 225, 'index': 'mod..."
4,2018-12-30T14:30Z,2018-12-30T15:00Z,"{'forecast': 220, 'actual': 235, 'index': 'mod..."
...,...,...,...
187,2019-01-03T10:00Z,2019-01-03T10:30Z,"{'forecast': 403, 'actual': 407, 'index': 'ver..."
188,2019-01-03T10:30Z,2019-01-03T11:00Z,"{'forecast': 406, 'actual': 407, 'index': 'ver..."
189,2019-01-03T11:00Z,2019-01-03T11:30Z,"{'forecast': 404, 'actual': 407, 'index': 'ver..."
190,2019-01-03T11:30Z,2019-01-03T12:00Z,"{'forecast': 404, 'actual': 404, 'index': 'ver..."


<h2>Extracting Data from Local Files</h2>
<br>
In this exercise, we will extract data from different local files, such as a PDF file, an image file, an Excel file, and a Word file. Follow these steps to implement this exercise.
<br>
<br>
<b>Import the <code>textract</code> library to extract text from a PDF file</b>

In [7]:
import textract
# textract does not work well with some files, such as pdf and png
import pandas as pd

In [9]:
data = pd.read_excel('data/sample_excel.xlsx')
data.head()

Unnamed: 0,name,qualification,additional qualification,other qualification
0,Gangaram,B.Tech,,
1,Ganga,B.A.,,
2,Ram,B.Tech,M.Tech,
3,Ramlal,B.Music,,Diploma in Music


In [4]:
textract.process("data/sample_word_document.docx")

b'Hamlet said to Horatio, There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy.'

<h2>Performing Various Operations on Local Files</h2>
<br>
In this exercise, we will perform various file operations, such as open, write, read, append, and close, on local files. Follow these steps to implement this exercise

<b>First, we create a text file and write a little content in it. Add the following code to implement this</b>

In [11]:
fp = open('data/sample_text.txt', 'w') 
fp.write("I'm in love text mining\n")
fp.close()

<b>To add more text into an existing text file, add the following code</b>

In [12]:
fp = open('data/sample_text.txt', 'a')
fp.write("I am learning Natural Language Processing\n")
fp.close()

<b>To read the content from the text file, add the following code</b>

In [13]:
fp = open('data/sample_text.txt', 'r')
fp.readlines()

["I'm in love text mining\n", 'I am learning Natural Language Processing\n']

<b>To open text files with various encodings, add the following code</b>

In [14]:
import nltk
nltk.download('unicode_samples')
file_location = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
fp = open(file_location,'r', encoding='latin2')
fp.readlines()

[nltk_data] Downloading package unicode_samples to
[nltk_data]     C:\Users\kleye\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\unicode_samples.zip.


['Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą\n',
 '"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez\n',
 'Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały\n',
 'odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki\n',
 'Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych\n',
 'archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.\n']

<b>To read these files line by line, insert a new cell and add the following code</b>

In [16]:
for line in open(file_location,'r', encoding='latin2'):
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą

"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki

Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych

archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.



<b>To close the opened file, add the following code</b>

In [17]:
fp.close()