# Text Collection and Preprocessing
You should collect and preprocess some textual data to investigate what is `GISMA` and how people perceive this brand on social media. In particular, you should do the following:
- Request and receive the [About GISMA](https://www.gisma.com/school/about-us) web page using [Requests](https://docs.python-requests.org/en/latest/user/quickstart/).
- Extract and clean up the main content of this web page using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).
- Create a free [Twitter developer account](http://apps.twitter.com/) to get access to the Twitter API.
- Build a client application to work with the Twitter API using [Tweepy](https://docs.tweepy.org/en/stable/).
- Search for the latest tweets about `GISMA` and clean up their content.
- Considering these collected textual data from two different sources (GISMA's website and Twitter), what can you say about this brand?

In [1]:

import sys
!{sys.executable} -m pip install textblob

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import requests
from bs4 import BeautifulSoup
import re
import nltk

import tweepy
from tweepy import OAuthHandler
from textblob import TextBlob



In [3]:
url = 'https://www.gisma.com/school/about-us'

In [4]:
page = requests.get(url)

In [5]:
page

<Response [200]>

In [6]:
page.status_code

200

In [7]:
page.text

'\n<!DOCTYPE html>\n<html lang=en>\n<head>\n<title>About Us - German International School of Management and Administration (GISMA)</title>\n<!-- Google Tag Manager -->\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);})(window,document,\'script\',\'dataLayer\',\'GTM-5QDPQC\');</script>\n<!---- Meta tags -->\n<meta http-equiv=content-type content="text/html; charset=utf-8"/>\n<meta http-equiv=X-UA-Compatible content="IE=edge"/>\n<meta name=keywords content="School of Administration GISMA Business School"/>\n<meta name=description content="About Us - GISMA Business School, which is located in Germany. GISMA offer Bachelors and Masters Degree Courses in Germany with focus on helping students become exceptional leaders in their own professi

In [8]:
soup = BeautifulSoup(page.text, 'html.parser')

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   About Us - German International School of Management and Administration (GISMA)
  </title>
  <!-- Google Tag Manager -->
  <script>
   (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-5QDPQC');
  </script>
  <!---- Meta tags -->
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="School of Administration GISMA Business School" name="keywords"/>
  <meta content="About Us - GISMA Business School, which is located in Germany. GISMA offer Bachelors and Masters Degree Courses in Germany with focus on helping students become exceptional leaders in their own professions." nam

In [10]:
soup.find_all('p')

[<p>We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Please read our Privacy and Cookies policy</p>,
 <p></p>,
 <p class="introduction-maintext">Our history, accreditations and partners are the strength of GISMA Business School</p>,
 <p><p><span class="GreenText">About GISMA Business School</span></p>
 <p><span>Since its foundation in 1999, GISMA Business School has paved the way for talented and qualified people to enter the international business world. Equipped with an interdisciplinary foundation and digital literacy, our graduates are able to pinpoint problem situations in companies of all sizes, start-ups or other organisations, and develop innovative solutions with commitment, motivation and creativity. With our goals in mind, we continue to expand and support students from all over the world to find their dream job and be successful.</span></p>
 <p><span>As a state-recognised university, GISMA Business School awards it

In [11]:
soup.find_all('p')[2].get_text()

'Our history, accreditations and partners are the strength of GISMA Business School'

In [12]:
soup.find_all(class_='chorus')

[]

In [13]:
soup.find_all(id='third')

[]

In [17]:

# Fetching the data
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text)
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)

# Tokenize sentences
sentences = nltk.sent_tokenize(text)

# Stopword list
stop_words = nltk.corpus.stopwords.words('english')

# Word counts 
word2count = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

# Converting counts to weights
max_count = max(word2count.values())
for key in word2count.keys():
    word2count[key] = word2count[key]/max_count
    

In [19]:
print(sentences)


['We use cookies to enhance your experience.', 'By continuing to visit this site you agree to our use of cookies.', 'Please read our Privacy and Cookies policyOur history, accreditations and partners are the strength of GISMA Business SchoolAbout GISMA Business School Since its foundation in 1999, GISMA Business School has paved the way for talented and qualified people to enter the international business world.', 'Equipped with an interdisciplinary foundation and digital literacy, our graduates are able to pinpoint problem situations in companies of all sizes, start-ups or other organisations, and develop innovative solutions with commitment, motivation and creativity.', 'With our goals in mind, we continue to expand and support students from all over the world to find their dream job and be successful.', "As a state-recognised university, GISMA Business School awards its own Bachelor's and Master's degrees.", 'In addition, we enjoy the trust of some of the best universities in Europe

In [20]:
print(clean_text)

we use cookies to enhance your experience by continuing to visit this site you agree to our use of cookies please read our privacy and cookies policyour history accreditations and partners are the strength of gisma business schoolabout gisma business school since its foundation in gisma business school has paved the way for talented and qualified people to enter the international business world equipped with an interdisciplinary foundation and digital literacy our graduates are able to pinpoint problem situations in companies of all sizes start ups or other organisations and develop innovative solutions with commitment motivation and creativity with our goals in mind we continue to expand and support students from all over the world to find their dream job and be successful as a state recognised university gisma business school awards its own bachelor s and master s degrees in addition we enjoy the trust of some of the best universities in europe to offer their degree programmes throug