#### The data analysis process is composed of the following steps:

* The statement of problem
* Obtain your data
* Clean the data
* Normalize the data
* Transform the data
* Exploratory statistics
* Exploratory visualization
* Predictive modeling
* Validate your model
* Visualize and interpret your results
* Deploy your solution

Data analysis process grouped:

<img src="dataanalysis1.png">

#### Open data

Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose. Following is a short list of repositories and databases for open data:

* Datahub is available at http://datahub.io/
* Book-Crossing Dataset is available at http://www.informatik.uni-freiburg.de/~cziegler/BX/
* World Health Organization is available at http://www.who.int/research/en/
* The World Bank is available at http://data.worldbank.org/
* NASA is available at http://data.nasa.gov/
* United States Government is available at http://www.data.gov/
* Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2
* Scientific Data from University of Muenster is available at http://data.uni-muenster.de/
* Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1

# Text parsing

In [7]:
import re

In the following examples, we will implement three of the most common validations
(e-mail, IP address, and date format).

E-mail validation:

In [9]:
myString = 'From: readers@packt.com (readers email)'
result = re.search('([\w.-]+)@([\w.-]+)', myString)
if result:
    print (result.group(0))
    print (result.group(1))
    print (result.group(2))

readers@packt.com
readers
packt.com


The function search() scans through a string, searching for any location
where the regex might match. The function group() helps us to return the
string matched by the regex. The pattern \w matches any alphanumeric
character and is equivalent to the class (a-z, A-Z, 0-9_).

IP address validation:

In [10]:
isIP = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
myString = " Your IP is: 192.168.1.254 "
result = re.findall(isIP,myString)
print(result)

['192.168.1.254']


The function findall() finds all the substrings where the regex matches,
and returns them as a list. The pattern \d matches any decimal digit, is
equivalent to the class [0-9] .

Date format:

In [30]:
myString = "01/04/2001"
isDate = re.match('[0-3][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}',myString)
if isDate:
    print("valid")
else:
    print("invalid")

valid


The function match() finds if the regex matches with the string. The pattern
implements the class [0-9] in order to parse the date format.

# Data transformation

Extract, Transform, and Load (ETL) obtains data from datasources, performs some transformation function depending on our data model and loads the result data into destination.

Parsing a CSV file with the csv module

In [31]:
import csv

In [34]:
with open("pokemon.csv") as f: #csv file with 8 lines only
    data = csv.reader(f)
    #Now we just iterate over the reader 
    
    for line in data: 
        print("id: {0}, typeTwo: {1}, name: {2}, type: {3}".format(line[0],line[1],line[2],line[3]))

id:  id, typeTwo:  typeTwo, name:  name, type:  type
id:  001, typeTwo:  Poison, name:  Bulbasaur, type:  Grass
id:  002, typeTwo:  Poison, name:  Ivysaur, type:  Grass
id:  003, typeTwo:  Poison, name:  Venusaur, type:  Grass
id:  006, typeTwo:  Flying, name:  Charizard, type:  Fire
id:  012, typeTwo:  Flying, name:  Butterfree, type:  Bug
id:  013, typeTwo:  Poison, name:  Weedle, type:  Bug
id:  014, typeTwo:  Poison, name:  Kakuna, type:  Bug
id:  015, typeTwo:  Poison, name:  Beedrill, type:  Bug
