## Lighthouse Labs
### W2D3 Other Data Types

Instructor: Mark Cassar
Original Notebook: Socorro Dominguez  

## Agenda
* Tabular data and tidy data 
* Other data types
    - JSON
    - XML
    - HTML
    - XLSX
    - Text
    - Images

## What is tabular data?
- each row is a single observation
- variables are in columns

## What is tidy data? 

- This concept stems from a paper written by renowned data scientist Hadley Wickham in 2014.

- We tidy our data so that we can create a standard across multiple analysis tools. It changes the focus from figuring out the logistics of how the data is structured, to answering the actual analysis question being asked.

Tidy data satisfies the following three criteria:

- Each row is a single observation,
- Each variable is a single column, and
- Each value is a single cell (i.e., its row, column position in the dataframe is not shared with another value)

## Tidy data

<img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width="800" />

What is a variable and an observation may depend on your immediate goal.

## Example 1: Is this tidy? 
*Examples source: https://garrettgman.github.io/tidying/*


| country | year  | cases_per_capita |
|---------|-------|---|
| Afghanistan | 1999|      745/19987071
|Afghanistan |2000    | 2666/20595360
|      Brazil |1999|   37737/172006362
|      Brazil |2000  | 80488/174504898
|      China |1999| 212258/1272915272
|      China |2000 |213766/1280428583

## Example 2: Is this tidy?

| country | cases (year=1999) | cases (year=2000)| population (year=1999) | population (year=2000)|
|---------|-------|-------|-------|-------|
| Afghanistan |   745 |  2666 |  19987071 |  20595360 |
|  Brazil | 37737 | 80488 | 172006362 | 174504898 |
|  China | 212258 | 213766 | 1272915272 | 1280428583 |

## Example 3: Is this tidy?


| country | year  | cases | population |
|---------|-------|-------|------------|
| Afghanistan | 1999  |  745  | 19987071|
| Afghanistan | 2000 |  2666 |  20595360|
|Brazil |1999 | 37737  |172006362|
| Brazil| 2000 | 80488 | 174504898|
| China | 1999 | 212258 |1272915272|
|  China |2000 | 213766 | 1280428583|

## Example 4: Is this tidy?


| country | year  | key | value |
|---------|-------|-------|------------|
|Afghanistan |1999 |     cases   |     745
| Afghanistan |1999| population  | 19987071
|  Afghanistan |2000|      cases |      2666
|  Afghanistan| 2000| population |  20595360
|       Brazil| 1999|      cases |     37737
|       Brazil |1999| population | 172006362
|       Brazil| 2000|      cases  |    80488
|       Brazil |2000| population | 174504898
|        China |1999|      cases  |   212258
|       China |1999| population |1272915272
|       China |2000|      cases |    213766
|       China |2000| population| 1280428583

**Dataset 3** is much easier to work with than the others. 

To work with the other datasets you need to take extra steps, making your code harder to write, and concepts harder to understand.  

The energy you need to manage a poor layout will increase with the size of your data. 

Avoid these difficulties by converting your data into a tidy format at the start of your analysis.

## What other types of data are useful to data scientists?


| Data Type | Example | What are we trying to use it for? |
|:-|:---------------------|:--------------------------------------------|
| Text         | Tweets, scripts, books | Sentiment analysis, other NLP |
| JSON or XML  | Parsing APIs | Gather data, data ingestion process, trend analysis, forecasting |
| HTML         |Web scraping| Get prices of different products, Facebook / LinkedIn contacts and work history|
| Images       |Computer vision|Self-driving cars, X-rays - diagnostics|

- Text: will be covered in more detail when we're doing Natural Language Processing (NLP)
- JSON: we will talk about nested JSON file today.
- XML and HTML: We saw the basics and we'll talk more about it today.
- Images: will be covered in more detail when we're doing Convolutional Neural Networks (CNNs)

## Different data, different tools

* Tabular data: `pandas`, `SQL`
* XML and HTML: `xml`
* JSON: `json`
* HTML - other alternatives: `BeautifulSoup` 
* Images - CNNs - Keras, PyTorch, TensorFlow
* Sequences - LSTMs - Keras, PyTorch, TensorFlow
* Text - nltk, SpaCy, Keras, PyTorch, TensorFlow


## Tabular data using pandas


In [2]:
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
     ------------------------------------ 242.1/242.1 kB 825.3 kB/s eta 0:00:00
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10


In [3]:
import pandas as pd
import openpyxl

In [5]:
wine_csv = pd.read_excel('data/wine.xlsx')
wine_csv

Unnamed: 0,Bottle,Grape,Origin,Alcohol,pH,Colour,Aroma
0,1,Chardonnay,Australia,14.23,3.51,White,Floral
1,2,Pinot Grigio,Italy,13.2,3.3,White,Fruity
2,3,Pinot Blanc,France,13.16,3.16,White,Citrus
3,4,Shiraz,Chile,14.91,3.39,Red,Berry
4,5,Malbec,Argentina,13.83,3.28,Red,Fruity


In [6]:
wine_xlsx = pd.read_excel('data/wine.xlsx', engine = 'openpyxl')
wine_xlsx

Unnamed: 0,Bottle,Grape,Origin,Alcohol,pH,Colour,Aroma
0,1,Chardonnay,Australia,14.23,3.51,White,Floral
1,2,Pinot Grigio,Italy,13.2,3.3,White,Fruity
2,3,Pinot Blanc,France,13.16,3.16,White,Citrus
3,4,Shiraz,Chile,14.91,3.39,Red,Berry
4,5,Malbec,Argentina,13.83,3.28,Red,Fruity


## Python and Excel
- They are complementary! It's not a competition.
- Uses for both
- For example:

In [7]:
wine_2 = pd.read_excel('data/wine_2.xlsx', engine = 'openpyxl')
wine_2

Unnamed: 0,This workbook contains data for different wines
0,Created by John Smith


- What's wrong? What happens when we open the file?


## Tab 1
![tab1](img/Tab1.png)

## Tab 2
![tab2](img/Tab2.png)

## Python and Excel ... continued
- After opening the file, we know that the useful data is on Sheet 2

In [8]:
wine_2 = pd.read_excel('data/wine_2.xlsx', sheet_name='DATA', engine = 'openpyxl') 
wine_2

Unnamed: 0,Bottle,Grape,Origin,Alcohol,pH,Colour,Aroma
0,1,Chardonnay,Australia,14.23,3.51,White,Floral
1,2,Pinot Grigio,Italy,13.2,3.3,White,Fruity
2,3,Pinot Blanc,France,13.16,3.16,White,Citrus
3,4,Shiraz,Chile,14.91,3.39,Red,Berry
4,5,Malbec,Argentina,13.83,3.28,Red,Fruity


## Data Science intro to XML and HTML
- hierarchical collection of elements
- element consists of start tag, content and end tag, and attributes
- tags have names and are delimited with `<` and `>`; end tag with `</`
- attributes added after the tag name within `<` and `>`

For example: `<country>` "Canada" `</country>`
- `country` is the tag
- `name="Canada"` is an attribute

**HTML is just XML for webpages.** 


## Let's see a HTML example

Let's take a look at an example: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners
- Use inspect page to see HTML, F12 for Windows Users
- Explain header, body, tags etc.

In [9]:
from bs4 import BeautifulSoup
import requests

In [10]:
scrape_url = 'https://canada.ca/en/services/science.html'
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html class="no-js" dir="ltr" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head prefix="og: http://ogp.me/ns#">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<title>Science and innovation - Canada.ca</title>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="http://purl.org/dc/terms/" rel="schema.dcterms"/>
<link href="https://www.canada.ca/en/services/science.html" rel="canonical"/>
<link href="https://www.canada.ca/en/services/science.html" hreflang="en" rel="alternate"/>
<link href="https://www.canada.ca/fr/services/science.html" hreflang="fr" rel="alternate"/>
<meta content="Learn about scientific research on health, the environment and space, and access programs and services that support business innovation. Includes how to fund, finance or partner ongoing research, protect intellectual property, and find Government of Canada scientists or datasets." name="description"/>
<meta content="Innovation, 

In [11]:
soup.find_all('p')

[<p>Scientific research, funding, datasets, innovation support, facilities and educational resources.</p>,
 <p>Find out what government can do for your business.</p>,
 <p>Find out how the Government of Canada is building Canada's biomanufacturing capacity</p>,
 <p>Funding and awards for scientific research, research infrastructure and research networks.</p>,
 <p>Government research and resources on a wide variety of scientific topics.</p>,
 <p>Government data, statistics, analyses and archival information to assist with research and discovery.</p>,
 <p>Government research centres across Canada and how to partner with or access these facilities.</p>,
 <p>Funding, collaboration, commercialization and licensing resources to help fuel innovation.</p>,
 <p>Protecting your intellectual property, trademarks, copyright and using <abbr title="Intellectual Property">IP</abbr> as a business tool.</p>,
 <p>A directory of government scientists and research professionals.</p>,
 <p>Activities, lesson

In [12]:
for item in soup.find_all('p'):
    print(item.get_text())
    print('\n')

Scientific research, funding, datasets, innovation support, facilities and educational resources.


Find out what government can do for your business.


Find out how the Government of Canada is building Canada's biomanufacturing capacity


Funding and awards for scientific research, research infrastructure and research networks.


Government research and resources on a wide variety of scientific topics.


Government data, statistics, analyses and archival information to assist with research and discovery.


Government research centres across Canada and how to partner with or access these facilities.


Funding, collaboration, commercialization and licensing resources to help fuel innovation.


Protecting your intellectual property, trademarks, copyright and using IP as a business tool.


A directory of government scientists and research professionals.


Activities, lesson plans, videos and more to help youth learn about science and technology.


Resources to take action to fight climate

### XML   
• hierarchical description of tagged data  

```
<library>
<book>
<title name='For Whom the Bell Tolls'>
<author>
Ernest Hemingway
</author>
</book>
<book>
<title>
The Stranger
</title>
<author>
Albert Camus
</author>
</book>
</library>
```

## XML
- Let's take a look at an XML file
- We use the `xml` package to work with XML

In [13]:
import xml.etree.ElementTree as et

In [14]:
# parses the XML so that it figures out the "tree" of tags
tree = et.parse('data/data.xml')
tree

<xml.etree.ElementTree.ElementTree at 0x2816d4a90a0>

In [15]:
# get to the root tag of the file
root = tree.getroot()
root

<Element 'data' at 0x000002816D61F9F0>

In [None]:
root.tag

From the root, we can begin to navigate the tree

In [16]:
# get root tag
print("What is the root tag:", root.tag)

# get root attributes
print("Attributes of the root tag:", root.attrib)

# get number of "children"
print("Number of children:", len(root))

What is the root tag: data
Attributes of the root tag: {}
Number of children: 3


In [17]:
for i, child in enumerate(root):
    print(f"Child {i}: {child.tag}")
    for gc in child:
        print(f"\t\t{gc.tag}")

Child 0: country
		rank
		year
		gdppc
		neighbor
		neighbor
Child 1: country
		rank
		year
		gdppc
		neighbor
Child 2: country
		rank
		year
		gdppc
		neighbor
		neighbor


In [18]:
# printing out the children of root
for idx in range(len(root)):
    print("tag:", root[idx].tag, "| attribute:", root[idx].attrib)

tag: country | attribute: {'name': 'Liechtenstein'}
tag: country | attribute: {'name': 'Singapore'}
tag: country | attribute: {'name': 'Panama'}


In [54]:
x = root[0].attrib
x

{'name': 'Liechtenstein'}

In [55]:
x['name']

'Liechtenstein'

In [20]:
country3 = root[2]

neighbour = country3[3]
neighbour.attrib

gdppc = country3[2]
gdppc

print(country3.attrib)
print(gdppc.tag, ":", gdppc.text)
print(neighbour.attrib)



{'name': 'Panama'}
gdppc : 13600
{'name': 'Costa Rica', 'direction': 'W'}


In [22]:
!pip install xmltodict

Defaulting to user installation because normal site-packages is not writeable
Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.13.0


In [23]:
import xmltodict, json

obj = xmltodict.parse("""
<employees>
    <employee>
        <name>Dave</name>
        <role>Sale Assistant</role>
        <age>34</age>
    </employee>
</employees>
""")
print(json.dumps(obj))

{"employees": {"employee": {"name": "Dave", "role": "Sale Assistant", "age": "34"}}}


## JSON

Here's some sample JSON: https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json

[JSON Plug-in](https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa?hl=en)

In [24]:
# You can read from the URL too!!
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'
json_data = pd.read_json(url)
json_data

Unnamed: 0,integer,datetime,category
0,5,2015-01-01 00:00:00,0
1,5,2015-01-01 00:00:01,0
2,9,2015-01-01 00:00:02,0
3,6,2015-01-01 00:00:03,0
4,6,2015-01-01 00:00:04,0
...,...,...,...
95,9,2015-01-01 00:01:35,0
96,8,2015-01-01 00:01:36,0
97,6,2015-01-01 00:01:37,0
98,8,2015-01-01 00:01:38,0


## JSON ... continued
- `pd.read_json` doesn't work so well with nested JSON...
- Let's take a look at what a nested JSON is: https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json

In [25]:
# an error happens when you run this cell
import pandas as pd 
from IPython.display import JSON
nested_json = pd.read_json('data/nested.json')
nested_json

ValueError: All arrays must be of the same length

Instead, you can try this:

In [26]:
import pprint
import json

with open('data/nested.json') as file:
    nested_json = json.load(file)

nested_json

{'article': [{'id': '01',
   'language': 'JSON',
   'edition': 'first',
   'author': 'Jane Doe'},
  {'id': '02', 'language': 'Python', 'edition': 'second', 'author': 'Mike'}],
 'blog': [{'name': 'LHL', 'URL': 'LighthouseLabs.com'}]}

In [27]:
pprint.pprint(nested_json)

{'article': [{'author': 'Jane Doe',
              'edition': 'first',
              'id': '01',
              'language': 'JSON'},
             {'author': 'Mike',
              'edition': 'second',
              'id': '02',
              'language': 'Python'}],
 'blog': [{'URL': 'LighthouseLabs.com', 'name': 'LHL'}]}


In [28]:
# Just Jupyter Lab
from IPython.display import JSON
JSON(nested_json)

<IPython.core.display.JSON object>

- What is the type?

In [29]:
type(nested_json)

dict

## Nested JSON 

In [30]:
pd.json_normalize(nested_json)

Unnamed: 0,article,blog
0,"[{'id': '01', 'language': 'JSON', 'edition': '...","[{'name': 'LHL', 'URL': 'LighthouseLabs.com'}]"


In [31]:
blog = pd.json_normalize(nested_json, record_path='blog')
blog.head()

Unnamed: 0,name,URL
0,LHL,LighthouseLabs.com


In [32]:
article = pd.json_normalize(nested_json, record_path ='article')
article.head()

Unnamed: 0,id,language,edition,author
0,1,JSON,first,Jane Doe
1,2,Python,second,Mike


In [33]:
nested_json2 = nested_json['article']
nested_json2

[{'id': '01', 'language': 'JSON', 'edition': 'first', 'author': 'Jane Doe'},
 {'id': '02', 'language': 'Python', 'edition': 'second', 'author': 'Mike'}]

In [34]:
nested_json2

[{'id': '01', 'language': 'JSON', 'edition': 'first', 'author': 'Jane Doe'},
 {'id': '02', 'language': 'Python', 'edition': 'second', 'author': 'Mike'}]

In [35]:
pd.json_normalize(nested_json2)

Unnamed: 0,id,language,edition,author
0,1,JSON,first,Jane Doe
1,2,Python,second,Mike


In [37]:
data = [{"state": "Florida", 
        "shortname": "FL",
        "info": {"governor": "Rick Scott", "governor_2": "John Doe"},
        "counties": [{"name": "Dade", "population": 12345},
                     {"name": "Broward", "population": 40000},
                     {"name": "Palm Beach", "population": 60000}]},
       {"state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [{"name": "Summit", "population": 1234},
                     {"name": "Cuyahoga", "population": 1337}]}]

In [38]:
pd.json_normalize(data)

Unnamed: 0,state,shortname,counties,info.governor,info.governor_2
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott,John Doe
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich,


In [39]:
pd.json_normalize(data=data, meta=['state', 'shortname'])

Unnamed: 0,state,shortname,counties,info.governor,info.governor_2
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott,John Doe
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich,


In [40]:
pd.json_normalize(data=data, record_path='counties', meta=['state', 'shortname'])

Unnamed: 0,name,population,state,shortname
0,Dade,12345,Florida,FL
1,Broward,40000,Florida,FL
2,Palm Beach,60000,Florida,FL
3,Summit,1234,Ohio,OH
4,Cuyahoga,1337,Ohio,OH


In [41]:
pd.json_normalize(data=data, record_path=['counties'], meta=['state', 'shortname', ['info', 'governor']])

Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


## Text

In [42]:
with open('data/sample.txt', 'r') as f:
    emma = f.readlines()

In [43]:
emma

['"[Emma by Jane Austen 1816]  VOLUME I  CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.  She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister\'s marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.  Sixteen years had Miss Taylor been in Mr. Woodhouse\'s family, less as a governess than a friend, very fond of both daughters, but particularly of Emma. Between _them_ it was more the intimacy of sisters. Even before Miss Taylor had ceased to hold the nominal office of governess, the mildness

In [45]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 741.5 kB/s eta 0:00:00
Collecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
     -------------------------------------- 96.6/96.6 kB 785.0 kB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     -------------------------------------- 78.5/78.5 kB 874.7 kB/s eta 0:00:00
Collecting joblib
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     ------------------------------------ 298.0/298.0 kB 836.4 kB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2022.9.13-cp39-cp39-win_amd64.whl (267 kB)
     ------------------------------------ 267.7/267.7 kB 867.0 kB/s eta 0:00:00
Installing collected packages: tqdm, regex, joblib, click, nltk
Successfully installed click-8.1.3 joblib-1.2.0 nltk-3.7 regex-2022.9.13 tqdm-4.64.1


In [49]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [50]:
tokens = nltk.word_tokenize(emma[0])
tokens

['``',
 '[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty-one',
 'years',
 'in',
 'the',
 'world',
 'with',
 'very',
 'little',
 'to',
 'distress',
 'or',
 'vex',
 'her',
 '.',
 'She',
 'was',
 'the',
 'youngest',
 'of',
 'the',
 'two',
 'daughters',
 'of',
 'a',
 'most',
 'affectionate',
 ',',
 'indulgent',
 'father',
 ';',
 'and',
 'had',
 ',',
 'in',
 'consequence',
 'of',
 'her',
 'sister',
 "'s",
 'marriage',
 ',',
 'been',
 'mistress',
 'of',
 'his',
 'house',
 'from',
 'a',
 'very',
 'early',
 'period',
 '.',
 'Her',
 'mother',
 'had',
 'died',
 'too',
 'long',
 'ago',
 'for',
 'her',
 'to',
 'have',
 'more',
 'than',
 'an',
 'i

In [51]:
set(tokens)

{"'s",
 ',',
 '--',
 '.',
 '1816',
 ';',
 'Austen',
 'Between',
 'CHAPTER',
 'Emma',
 'Even',
 'Her',
 'I',
 'It',
 'Jane',
 'Miss',
 'Mr.',
 'She',
 'Sixteen',
 'Sorrow',
 'Taylor',
 'The',
 'VOLUME',
 'Woodhouse',
 '[',
 ']',
 '_them_',
 '``',
 'a',
 'affection',
 'affectionate',
 'after',
 'ago',
 'all',
 'allowed',
 'alloy',
 'an',
 'and',
 'any',
 'as',
 'at',
 'attached',
 'authority',
 'away',
 'been',
 'before',
 'being',
 'beloved',
 'best',
 'blessings',
 'both',
 'bride-people',
 'brought',
 'but',
 'by',
 'came',
 'caresses',
 'ceased',
 'cheer',
 'chiefly',
 'clever',
 'comfortable',
 'composed',
 'consciousness.',
 'consequence',
 'continuance',
 'danger',
 'daughters',
 'did',
 'died',
 'dine',
 'dinner',
 'directed',
 'disadvantages',
 'disagreeable',
 'disposition',
 'distress',
 'doing',
 'early',
 'enjoyments',
 'esteeming',
 'evening',
 'event',
 'every',
 'evils',
 'excellent',
 'existence',
 'fallen',
 'family',
 'father',
 'first',
 'fond',
 'for',
 'friend',
 'f

In [53]:
!pip install tensorflow

^C
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow
  Downloading tensorflow-2.10.0-cp39-cp39-win_amd64.whl (455.9 MB)
     ------------------------------------ 455.9/455.9 MB 779.8 kB/s eta 0:00:00
Collecting absl-py>=1.0.0
  Downloading absl_py-1.2.0-py3-none-any.whl (123 kB)
     ------------------------------------ 123.4/123.4 kB 725.1 kB/s eta 0:00:00
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting typing-extensions>=3.6.6
  Downloading typing_extensions-4.3.0-py3-none-any.whl (25 kB)
Collecting wrapt>=1.11.0
  Downloading wrapt-1.14.1-cp39-cp39-win_amd64.whl (35 kB)
Collecting libclang>=13.0.0
  Downloading libclang-14.0.6-py2.py3-none-win_amd64.whl (14.2 MB)
     -------------------------------------- 14.2/14.2 MB 843.1 kB/s eta 0:00:00
Collecting h5py>=2.9.0
  Downloading h5py-3.7.0-cp39-cp39-win_amd64.whl (2.6 MB)
     ---------------------------------------- 2.6/2.6 MB 856.7 kB/s e

In [52]:
from tensorflow.keras.layers import TextVectorization

max_tokens = 200
max_length = 10

text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)

text_vectorization.adapt(emma)

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
text_vectorization.get_vocabulary()

In [None]:
raw_text_data = ([
    ["Emma was always there."],
])

text_vectorization(raw_text_data)

## Images

In [None]:
from tensorflow.keras.datasets import cifar10

In [None]:
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

In [None]:
train_images.shape

In [None]:
import matplotlib.pyplot as plt

plt.imshow(train_images[40])

## Functions that help you Tidy your Data
- Review on your own after class


* pd.pivot() - Return reshaped DataFrame organized by given index / column values. 
* pd.pivot_table()
* pd.melt()  - Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
* pd.stack() - Stack the prescribed level(s) from columns to index.
* pd.unstack() - Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.

In [None]:
import pandas as pd
import numpy as np

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                   ['one', 'two','one', 'two','one', 'two','one', 'two']]))
tuples

In [None]:
index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
index

In [None]:
import numpy as np
stacked = pd.DataFrame(np.random.randn(8,2), index = index, columns=['A', 'B'])
stacked

In [None]:
stacked.unstack()

In [None]:
# Level(s) of index to unstack, can pass level name.
stacked.unstack('first')

In [None]:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

In [None]:
df

In [None]:
pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum)