<a href="https://colab.research.google.com/github/mal489/TheOne/blob/main/D2_DataTypes_python3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 103: Introduction to data science <br> Demo \#2: Data Types <br> Author: JRW

## Data
Here's a list of the files that we will be working with in this notebook:

* Image: ./Rgb-raster-image_smile.jpg
* Text: ./MobyDick.txt
* Spreadsheet: ./APPL.csv
* html: ./index.html
* json: ./youtube.json
* xml: ./simple.xml

All of the paths start with a './' because they should be in the same directory as this notebook. Also, here are the sources, for reference:

* Image: https://en.wikipedia.org/wiki/Raster_graphics
* Text: http://www.gutenberg.org/ebooks/2701
* Spreadsheet: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
* html: http://www.example.com
* json: https://www.sitepoint.com/10-example-json-files/
* xml: https://www.w3schools.com/xml/xml_examples.asp

### Modules

In [None]:
import numpy as np ## numpy provides a suite of numerical
import pandas as pd ## pandas provides nice ordered array (spreadsheet) handling
import re ## re provides regular expressions, which help with text
import json ## json provides json-format data management
import datetime ## datetime interprets dates as numerical quantities
from bs4 import BeautifulSoup ## Beautifulsoup parses html, thus, also xml!
from IPython.core.display import display ## display allows you to show an image here in the notebook
from PIL import Image ## Image allows you to edit image data
from io import BytesIO

### Spreadsheets
I'm pretty basic when it comes to programming, so I usually just use base Python with for loops and regular expressions, but for convenience we'll use pandas, which is great for manipulating spreadsheets as 2-d arrays. Here are what the parameters we use:

* filepath_or_buffer: specifies the name of the file
* sep: specifies the column separator, here, commmas
* header: specifies where the header is, here, row 1
* parse_dates: specifies columns to be treated as timestamp objects

The resulting pandas object is a "dataframe".

In [None]:
APPL = pd.read_csv(filepath_or_buffer="APPL.csv", sep=",", header=0, parse_dates = [0])
print(APPL.keys())

You can access the rows with indices:

In [None]:
APPL.head(10)

and the columns by dictionary keys:

In [None]:
print(APPL["Date"][1:10])

Notice the "Open" column has floats, while the "Volume" column has ints:

In [None]:
print(type(APPL["Open"][0]))
print(type(APPL["Volume"][0]))

In [None]:
APPL["Open"][0]

The weird column is Date, which has timestamp objects:

In [None]:
print(type(APPL["Date"][0]))

You can do the usual math with floats and ints:

In [None]:
## element-wise subtraction
print(APPL["Close"][0] - APPL["Open"][1])
print("")
## vector-wise subtraction
print(APPL["Close"][0:10] - APPL["Open"][0:10])

#### So this is actually a type of timeseries data
There are seval numeric data columns, with the dates as the time stamps. If you wanted to plot any of the columns as a timeseries, you would have to be sure the intervals are correct! This can be done automatically, or you can subtract adjacent values to determine intervals. Here's an example subtracting two times to get the difference in days:

In [None]:
print(APPL["Date"][0])
print(APPL["Date"][9])
print("")
print("Day 0 and day 9 are actually "+str((APPL["Date"][0] - APPL["Date"][9]).days)+" days apart!")

### Text
The file I've got for you here is Hermin Mellville's Moby Dick. This is kind of like the E. coli/lab rat of the text analysis world. I'm going to go super basic and load this text file in one chunk as a string.

In [None]:
with open("MobyDick.txt", "r",encoding='utf-8') as f:
    MobyDick = f.read()

Strings are really just lists, so we can look at the beginning of the text by slicing:

In [None]:
MobyDick[6]

In [None]:
print(MobyDick[:10000])

#### Text is 'messy'
As you can see, there is lots of leader text surrounding the main content, so a main focus with text processing if figuring out how to get what we want. This is where regular expressions come in handy. For the sake of simplicity, let's just grab the index.

In [None]:
index = {}
for chapter in re.finditer("(CHAPTER\s+\d+)\.\s+([^\n\r]+)", MobyDick):
    ChapNum = chapter.group(1)
    if not index.get(ChapNum, False):
        index[ChapNum] = {
            "title": chapter.group(2),
            "text": ""
        }
print("Total chapters: "+str(len(index.keys())))
print("")
print("This is what we've got for chapter 2:\n")
print(index["CHAPTER 2"])

Grabbing the index was pretty easy, because it was "semi-structured", but other forms of text and tasks are really not this nice! For example, now that we have the chapter names, we may want to find the individual chapters and separate them. How could we do this? (Note: The last chapter's text will include any text at the bottom of the document, requiring further preprocessing.)

In [None]:
ByChapter = re.split("\n(CHAPTER \d+)\. ", MobyDick)

In [None]:
print(ByChapter[3])

In [None]:
ByChapter = re.split("\n(CHAPTER \d+)\. ", MobyDick)
print("Total splits: "+str(len(ByChapter)))
TotalChapterTextsFound = 0
for i in range(135):
    if index.get(ByChapter[2*i+1],False):
        TotalChapterTextsFound += 1
        index[ByChapter[2*i+1]]["text"] = ByChapter[2*(i+1)]
#         print("Found the text for "+ByChapter[2*i+1]+": "+index[ByChapter[2*i+1]]["title"])
print("\nFound "+str(TotalChapterTextsFound)+" chapter texts in total!")
print("Here's what we got for CHAPTER 2: \n\n"+index["CHAPTER 2"]["text"])

#### Pre-processing also includes feature selection
A primary feature selection task is called "tokenization." A basic way to get tokens is to split by spaces. Counting these "words" up creates the "bag-of-words" framework for text analysis.

In [None]:
for chapter in index:
    tokens = re.split(" ", index[chapter]["text"])
    index[chapter]["counts"] = {}
    for token in tokens:
        index[chapter]["counts"].setdefault(token, 0)
        index[chapter]["counts"][token] += 1

words = sorted(index["CHAPTER 2"]["counts"], key=lambda x: index["CHAPTER 2"]["counts"][x], reverse = True)
print("Here are the top 100 words in CHAPTER 2:\n\n")

print("\n".join([": ".join(x) for x in zip(map(str,range(1,101)),words[0:100])]))

### Images
Image data is unstructured like text, but presents quite different challenges, as it does not generally represent a language. Let's start by loading our image in and displaying it.

In [None]:
with open('Rgb-raster-image_smile.jpg', 'rb') as inf:
    jpgdata = inf.read()
image_object = Image.open(BytesIO(jpgdata))

In [None]:
print(image_object)
display(image_object)

#### What does image data actually look like?
Recall, jpgs are just matricies pixes, which record the intensities of three colors: Red Green and Blue. This is encoded with a python image object as a three dimensional array, i.e., a matrix of triplicate numbers. This image is of size 368 x 400, so we can access pixels in x and y dimensions with indices up to those numbers! (Note: Our image object is only a generator, so the pixels are not actually retrieved unless they are loaded.)

In [None]:
px = image_object.load()
print(px[0,0])
print(px[183,199])
print(px[183,100])

To change a pixel to black, all we have to do is zero-out each of the bytes. To change a pixel to white, all we have to do is 255 (recall: there are 256 ways to arrange a single byte). We can do these operations for a bunch of pixes by double looping. (Note: When we alter the pixels, the changes take effect in the image object.)

In [None]:
for i in range(100):
    for j in range(100):
        px[i,j] = (0,0,0)

for i in range(200,301):
    for j in range(200,301):
        px[i,j] = (255,255,255)
display(image_object)

In [None]:
for i in range(268,368):
    for j in range(300,400):
        px[i,j] = (0,0,0)
display(image_object)

All you have to do is change the RBG intensities to make a variety of colors.

In [None]:
for i in range(0,368):
    for j in range(200,301):
        px[i,j] = ((i + j) % 256,i % 256,j % 256)
display(image_object)

#### What about features here?
Features in images take on physical meanings, perhaps related to how we recognize things. So, for faces this might be the distances between eyes. However this means figuring out where the eyes are first.

In [None]:
for i in range(166,206):
    for j in range(101,141):
        px[i,j] = px[i,j] = (0,0,0)

for i in range(246,286):
    for j in range(101,141):
        px[i,j] = px[i,j] = (0,0,0)
display(image_object)

So, it looks like the eyes here are about 80 pixels apart. The measurement of this feature might discern our smile from other smiles whose eyes are wider apart. For fun, let's reload the data (since we kind of wrecked it) and give our smile some crazy eyes!

In [None]:
with open('Rgb-raster-image_smile.jpg', 'rb') as inf:
    jpgdata = inf.read()
image_object = Image.open(BytesIO(jpgdata))
px = image_object.load()

for i in range(166,206):
    for j in range(101,141):
        px[i,j] = ((i * j) % 256,(i + j) % 256,(i - j) % 256)

for i in range(246,286):
    for j in range(101,141):
        px[i,j] = ((i * j) % 256,(i + j) % 256,(i - j) % 256)
display(image_object)

### XML
The Extended Markup Language (XML) format is a verbose type of associative data. It uses nested tags: <\key> ...value... </\key>. To associate values to keys. Here, we've got a restaurant menu in XML format.

In [None]:
with open("simple.xml", "r") as f:
    menu = f.read()
print(menu)

To get the keys and values out of the XML we will need to parse it. We could spend a long time cooking up our own regular expressions for this, but people have already done this quite well, and packed it into the BeautifulSoup module.

In [None]:
soup = BeautifulSoup(menu, 'lxml')
print(soup.find('food'))

It looks like the schema has each food with a name, price, description and calories. Let's code this into a nice python dict object so we could navigate it more easily.

In [None]:
ourmenu = {}
for food in soup.find_all('food'):
    ourmenu[food.find("name").text] = {
        "name": food.find("name").text,
        "price": float(re.sub("\$","",food.find("price").text)),
        "description": food.find("description").text,
        "calories": int(food.find("calories").text)
    }
print("Here's what we've got not for Belgian Waffles:\n\n")
print(ourmenu["Belgian Waffles"])

Since we've cvonverted the prices and calories to floats and ints, respectively, we can now measure some totals for a given mean. For example, let's order a Homestyle Breakfast with a side of Frencch Toast and check the damage.

In [None]:
price = 0.
calories = 0
for name in ["Homestyle Breakfast","French Toast"]:
    print("Name: "+name)
    print("Calories: "+str(ourmenu[name]["calories"]))
    print("Price: $"+str(ourmenu[name]["price"]))
    print("")
    price += ourmenu[name]["price"]
    calories += ourmenu[name]["calories"]
print("Total calories: "+str(calories))
print("Total price: $"+str(price))

### HTML
HTML is just XML with a specific schema for encoding webpages. So, we can use BeautifulSoup just the same here to get text content and hyperlinks, etcetera.

In [None]:
with open("index.html", "r") as f:
    webpage = f.read()
print(webpage)

Usually, content text lives in <\p> tags in the <\body>, and hyperlinks live in <\a> tags.

In [None]:
websoup = BeautifulSoup(webpage, 'lxml')
print("Here are any body paragraphs:\n")
for paragraph in websoup.find_all("body"):
    print("Paragraph: "+paragraph.find('p').text)
    print("")
print("\n")
print("Here are any Hyperlink display texts and URLs:\n")
for hyperlink in websoup.find_all("a"):
    print("Hyperlink display text: "+hyperlink.text)
    print("Hyperlink URL: "+hyperlink["href"])

### JSON
JSON is probably the favorite associative array data type for most data scientists. In Pyton, it's very easy to load, and for its similarity to the Python dict object type can generally be navigated as such. Here, our json file is an example from YouTube's API.

In [None]:
with open("youtube.json", "r") as f:
    video = json.loads(f.read())
print("The data is actually one level into the schema:\n")
print(video.keys())
print("")
print("Here are the keys for the data on this video:\n")
print(video["data"].keys())

Within a list, the individual items are a good place to start exploring. Let's look at the first (only) item. Note That the values that we are printing out must be converted to strings. This is actually awesome, because it means that the json format implicitly encodes the object types, even in a text serialization (unlike XML). This is a real advantage for convenience in working with data&mdash;you don't necessarily have to cast types!

In [None]:
for key in video["data"]["items"][0]:
    print(key+": "+str(video["data"]["items"][0][key]))

Creating your own json files is also quite easy because of the dict-json parallel. Let's take our converted restaurant menu and write out to a json file.

In [None]:
with open("simple.json", "w") as f:
    f.write(json.dumps(ourmenu))

The really cool thing about writing out the menu to json format (as we did) is that the json (unlike XML) format will carry forward the object tupes in the file. So, just as our YouTube data came in as ints and floats, the type changes that we made to the menu will be preserved in the output format!