#### The data analysis process is composed of the following steps:

* The statement of problem
* Obtain your data
* Clean the data
* Normalize the data
* Transform the data
* Exploratory statistics
* Exploratory visualization
* Predictive modeling
* Validate your model
* Visualize and interpret your results
* Deploy your solution

Data analysis process grouped:

<img src="dataanalysis1.png">

#### Open data

Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose. Following is a short list of repositories and databases for open data:

* Datahub is available at http://datahub.io/
* Book-Crossing Dataset is available at http://www.informatik.uni-freiburg.de/~cziegler/BX/
* World Health Organization is available at http://www.who.int/research/en/
* The World Bank is available at http://data.worldbank.org/
* NASA is available at http://data.nasa.gov/
* United States Government is available at http://www.data.gov/
* Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2
* Scientific Data from University of Muenster is available at http://data.uni-muenster.de/
* Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1

# Chapter 2 - Working with Data

# Text parsing

In [7]:
import re

In the following examples, we will implement three of the most common validations
(e-mail, IP address, and date format).

E-mail validation:

In [9]:
myString = 'From: readers@packt.com (readers email)'
result = re.search('([\w.-]+)@([\w.-]+)', myString)
if result:
    print (result.group(0))
    print (result.group(1))
    print (result.group(2))

readers@packt.com
readers
packt.com


The function search() scans through a string, searching for any location
where the regex might match. The function group() helps us to return the
string matched by the regex. The pattern \w matches any alphanumeric
character and is equivalent to the class (a-z, A-Z, 0-9_).

IP address validation:

In [10]:
isIP = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
myString = " Your IP is: 192.168.1.254 "
result = re.findall(isIP,myString)
print(result)

['192.168.1.254']


The function findall() finds all the substrings where the regex matches,
and returns them as a list. The pattern \d matches any decimal digit, is
equivalent to the class [0-9] .

Date format:

In [30]:
myString = "01/04/2001"
isDate = re.match('[0-3][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}',myString)
if isDate:
    print("valid")
else:
    print("invalid")

valid


The function match() finds if the regex matches with the string. The pattern
implements the class [0-9] in order to parse the date format.

# Data transformation

Extract, Transform, and Load (ETL) obtains data from datasources, performs some transformation function depending on our data model and loads the result data into destination.

## Parsing a CSV file with the csv module

In [6]:
import csv

 The first eight records of the CSV file ( pokemon.csv ) look as follows:
 
 ```
 id, typeTwo, name, type
 001, Poison, Bulbasaur, Grass
 002, Poison, Ivysaur, Grass
 003, Poison, Venusaur, Grass
 006, Flying, Charizard, Fire
 012, Flying, Butterfree, Bug
 013, Poison, Weedle, Bug
 014, Poison, Kakuna, Bug
 015, Poison, Beedrill, Bug
 ```

In [9]:
with open("pokemon.csv") as f: #csv file with 8 lines only
    data = csv.reader(f)
    #Now we just iterate over the reader 
    next(f) #jump first line
    for line in data: 
        print("id: {0}, typeTwo: {1}, name: {2}, type: {3}"
              .format(line[0],line[1],line[2],line[3]))

id:  001, typeTwo:  Poison, name:  Bulbasaur, type:  Grass
id:  002, typeTwo:  Poison, name:  Ivysaur, type:  Grass
id:  003, typeTwo:  Poison, name:  Venusaur, type:  Grass
id:  006, typeTwo:  Flying, name:  Charizard, type:  Fire
id:  012, typeTwo:  Flying, name:  Butterfree, type:  Bug
id:  013, typeTwo:  Poison, name:  Weedle, type:  Bug
id:  014, typeTwo:  Poison, name:  Kakuna, type:  Bug
id:  015, typeTwo:  Poison, name:  Beedrill, type:  Bug


## Parsing a CSV file using NumPy

In [12]:
import numpy as np

NumPy provides us with the ```genfromtxt``` function, which receives four parameters. First, we need to provide the name of the file ```pokemon.csv```. Then we skip first line as a header (```skip_header```). Next we need to specify the data type (```dtype```). Finally, we will define the comma as the delimiter.

In [14]:
data = np.genfromtxt("pokemon.csv", skip_header=1, dtype=None, delimiter=',')

In [15]:
print(data)

[( 1, b' Poison', b' Bulbasaur', b' Grass')
 ( 2, b' Poison', b' Ivysaur', b' Grass')
 ( 3, b' Poison', b' Venusaur', b' Grass')
 ( 6, b' Flying', b' Charizard', b' Fire')
 (12, b' Flying', b' Butterfree', b' Bug')
 (13, b' Poison', b' Weedle', b' Bug')
 (14, b' Poison', b' Kakuna', b' Bug')
 (15, b' Poison', b' Beedrill', b' Bug')]


## Parsing a JSON file using json module

In [16]:
import json
from pprint import pprint

In [20]:
with open("pokemon.json") as f:
    data = json.loads(f.read())

In [21]:
pprint(data)

[{'id': '001', 'name': 'Bulbasaur', 'type': 'Grass', 'typeTwo': 'Poison'},
 {'id': '002', 'name': 'Ivysaur', 'type': 'Grass', 'typeTwo': 'Poison'},
 {'id': '003', 'name': 'Venusaur', 'type': 'Grass', 'typeTwo': 'Poison'},
 {'id': '006', 'name': 'Charizard', 'type': 'Fire', 'typeTwo': 'Flying'},
 {'id': '012', 'name': 'Butterfree', 'type': 'Bug', 'typeTwo': 'Flying'},
 {'id': '013', 'name': 'Weedle', 'type': 'Bug', 'typeTwo': 'Poison'},
 {'id': '014', 'name': 'Kakuna', 'type': 'Bug', 'typeTwo': 'Poison'},
 {'id': '015', 'name': 'Beedrill', 'type': 'Bug', 'typeTwo': 'Poison'}]


# XML

Parsing an XML file in Python using xml module

In [23]:
from xml.etree import ElementTree

In [24]:
with open("pokemom.xml") as f:
    doc = ElementTree.parse(f)

In [26]:
for node in doc.findall('row'):
    print("")
    print("id:      {0}".format(node.find('id').text))
    print("typeTwo: {0}".format(node.find('typeTwo').text))
    print("name:    {0}".format(node.find('name').text))
    print("type:    {0}".format(node.find('type').text))


id:       001
typeTwo:  Poison
name:     Bulbasaur
type:     Grass

id:       002
typeTwo:  Poison
name:     Ivysaur
type:     Grass

id:       003
typeTwo:  Poison
name:     Venusaur
type:     Grass

id:       006
typeTwo:  Flying
name:     Charizard
type:     Fire

id:       012
typeTwo:  Flying
name:     Butterfree
type:     Bug

id:       013
typeTwo:  Poison
name:     Weedle
type:     Bug

id:       014
typeTwo:  Poison
name:     Kakuna
type:     Bug

id:       015
typeTwo:  Poison
name:     Beedrill
type:     Bug


# Chapter 3 - Data Visualization

In the basic examples, we can just open our HTML document in a web browser to view it. But when we need to load external data sources, we need to publish the folder on a web server such as Apache, nginx, or IIS. Python provides us with an easy way to run a web server with **http.server** ; we just need to open the folder where our D3 files are present and execute the following command in the terminal.

In [2]:
#python3 –m http.server 8000

### Bar chart

In [1]:
# We need to import the necessary modules.
import json
import csv
from pprint import pprint

In [2]:
#Now, we define a dictionary to store the result
typePokemon = {}
#Open and load the JSON file.
with open("pokemon.json") as f:
    data = json.loads(f.read())
    
    #Fill the typePokemon dictionary with sum of pokemon by type
    for line in data:
        if line["type"] not in typePokemon:
            typePokemon[line["type"]] = 1
        else:
            typePokemon[line["type"]]=typePokemon.get(line["type"])+1

In [12]:
#Open in a write mode the sumPokemon.csv file
with open("sumPokemon.csv", "w") as a:
    w = csv.writer(a)
    
#Sort the dictionary by the NAME of pokemon
#writes the result (type and amount) into the csv file
    for key, value in sorted(typePokemon.items(), key=lambda x: x[1]):
        w.writerow([key,str(value)])

In [13]:
#finally, we use "pretty print" to print the dictionary
pprint(typePokemon)

{' Bug': 45,
 ' Dark': 16,
 ' Dragon': 12,
 ' Electric': 7,
 ' Fighting': 3,
 ' Fire': 14,
 ' Ghost': 10,
 ' Grass': 31,
 ' Ground': 17,
 ' Ice': 11,
 ' Normal': 29,
 ' Poison': 11,
 ' Psychic': 9,
 ' Rock': 24,
 ' Steel': 13,
 ' Water': 45}


In [14]:
typePokemon.items()

dict_items([(' Poison', 11), (' Psychic', 9), (' Fire', 14), (' Bug', 45), (' Electric', 7), (' Dark', 16), (' Normal', 29), (' Steel', 13), (' Grass', 31), (' Water', 45), (' Rock', 24), (' Fighting', 3), (' Dragon', 12), (' Ground', 17), (' Ice', 11), (' Ghost', 10)])

In [15]:
typePokemon.keys()

dict_keys([' Poison', ' Psychic', ' Fire', ' Bug', ' Electric', ' Dark', ' Normal', ' Steel', ' Grass', ' Water', ' Rock', ' Fighting', ' Dragon', ' Ground', ' Ice', ' Ghost'])

In [16]:
typePokemon.values()

dict_values([11, 9, 14, 45, 7, 16, 29, 13, 31, 45, 24, 3, 12, 17, 11, 10])

In [19]:
#Open in a write mode the sumPokemon.csv file
with open("sumPokemon.csv", "w") as a:
    w = csv.writer(a)
    
#Sort the dictionary by number of pokemon
#writes the result (type and amount) into the csv file
    for key, value in sorted(typePokemon.values(), key=lambda x: x[1]):
        w.writerow([key,str(value)])

#finally, we use "pretty print" to print the dictionary
pprint(typePokemon)

TypeError: 'int' object is not subscriptable