There are few standard way in handling data. Usually Data would be a csv format which can be read or written by using basic python instruction  given below.

![image.png](attachment:image.png)
![image.png](attachment:image.png)
![image.png](attachment:image.png)
![image.png](attachment:image.png)

Lets see an example how to handle JSON Data (JavaScript Object Notation), it has become one of the standard formats for sending dadta by HTTP request between web browsers and other application

In [1]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [2]:
import json
result = json.loads(obj) #converts json into python

In [3]:
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [4]:
as_json = json.dumps(result) # converts a python object back to JSON

In [5]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns= ['name','age'])

In [6]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


Unnamed: 0,name,age
0,Scott,30
1,Katie,38


In [7]:
# You can use "data = pd.read_json" to import json file

### XML and HTML Web Scraping 

Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. Examples include lxml, Beautiful Soup, and html5lib. While lxml is
comparatively much faster in general, the other libraries can better handle malformed
HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml and Beautiful
Soup to automatically parse tables out of HTML files as DataFrame objects. To
show how this works, I downloaded an HTML file (used in the pandas documentation)
from the United States FDIC government agency showing bank failures.1 First,
you must install some additional libraries used by read_html
`conda install lxml
pip install beautifulsoup4 html5lib`

### Web APIs

In [8]:
import requests

In [9]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [10]:
resp = requests.get(url)

In [11]:
resp

<Response [200]>

<Response [200]>

The Response object’s json method will return a dictionary containing JSON parsed into native Python objects

In [12]:
data = resp.json()

In [13]:
data[0]['title']

'Option to return all idxmax rows in case of ties'

'Option to return all idxmax rows in case of ties'

In [14]:
issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])

In [15]:
issues

Unnamed: 0,number,title,labels,state
0,34205,Option to return all idxmax rows in case of ties,[],open
1,34204,BUG: stacked histogram on datetimes,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
2,34203,"REF: move get_freq_group to libfreqs, de-dupli...","[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
3,34202,REF: de-duplicate methods calling get_dst_info,[],open
4,34201,ENH: Implement __iter__ for Rolling and Expanding,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
5,34200,CLN/TYP: Groupby agg methods,[],open
6,34199,PERF: Fixes performance regression in DataFram...,[],open
7,34198,CLN: let Index use general concat machinery - ...,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open
8,34197,ENH: implement efficient sorting methods for S...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
9,34196,REF: remove redundant get_freq function,"[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open


Unnamed: 0,number,title,labels,state
0,34205,Option to return all idxmax rows in case of ties,[],open
1,34204,BUG: stacked histogram on datetimes,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
2,34203,"REF: move get_freq_group to libfreqs, de-dupli...","[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
3,34202,REF: de-duplicate methods calling get_dst_info,[],open
4,34201,ENH: Implement __iter__ for Rolling and Expanding,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
5,34200,CLN/TYP: Groupby agg methods,[],open
6,34199,PERF: Fixes performance regression in DataFram...,[],open
7,34198,CLN: let Index use general concat machinery - ...,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open
8,34197,ENH: implement efficient sorting methods for S...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
9,34196,REF: remove redundant get_freq function,"[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open


### Interacting with Databases
In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is
usually dependent on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll create a SQLite database using Python’s built-in sqlite3 driver

In [16]:
import sqlite3

In [17]:
query = """ CREATE TABLE test (a VARCHAR(20), b VARCHAR(20),c REAL, d INTEGER);"""

In [18]:
con = sqlite3.connect('mydata.sqlite')

In [19]:
data = [('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California', 1.7, 5)]

In [20]:
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [21]:
con.executemany(stmt, data)

<sqlite3.Cursor at 0x12071a49960>

<sqlite3.Cursor at 0x12071a49960>

In [22]:
con.commit()

In [23]:
cursor = con.execute('select * from test')

In [24]:
rows = cursor.fetchall()

In [25]:
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [26]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [27]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


This is quite a bit of munging that you’d rather not repeat each time you query the database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of the common differences between SQL databases. pandas has a read_sql function that enables you to read data easily from a general SQLAlchemy
connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and read data from the table created before

In [28]:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


### Data Cleaning and Preparation 
A significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. usually data are stroed in files or databases.

## handling Missing Data

In [29]:
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [30]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [31]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

0    False
1    False
2     True
3    False
dtype: bool

In [32]:
string_data[0]= None

In [33]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

0     True
1    False
2     True
3    False
dtype: bool

![image.png](attachment:image.png)

In [34]:
# Filtering out Missing Data

from numpy import nan as NA
data = pd.Series([1,NA, 3.5,NA, 7])

In [35]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

0    1.0
2    3.5
4    7.0
dtype: float64

In [36]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

0    1.0
2    3.5
4    7.0
dtype: float64

In [37]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])

In [38]:
cleaned = data.dropna()

In [39]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [40]:
cleaned # As you can see every colomn with missing value has been removed

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [41]:
data.dropna(how='all') # This only removes the colomn with all missing values

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [42]:
data[4] = NA

In [43]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [44]:
data.dropna(axis = 1, how = 'all') # Removes the row with missing values 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [45]:
df = pd.DataFrame(np.random.randn(7,3))

In [46]:
df.iloc[:4,1] = NA

In [47]:
df

Unnamed: 0,0,1,2
0,-2.344613,,1.617325
1,-1.221907,,1.182014
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
0,-2.344613,,1.617325
1,-1.221907,,1.182014
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [48]:
df.iloc[:2, 2] = NA

In [49]:
df

Unnamed: 0,0,1,2
0,-2.344613,,
1,-1.221907,,
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
0,-2.344613,,
1,-1.221907,,
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [50]:
df.dropna() # Removes every NaN value

Unnamed: 0,0,1,2
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [51]:
df.dropna(thresh = 2) # Removes those value with a threshold of 2, not more than that

Unnamed: 0,0,1,2
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
2,0.114244,,-1.025095
3,-1.131351,,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [52]:
# Filling Missing Data
df.fillna(0)

Unnamed: 0,0,1,2
0,-2.344613,0.0,0.0
1,-1.221907,0.0,0.0
2,0.114244,0.0,-1.025095
3,-1.131351,0.0,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
0,-2.344613,0.0,0.0
1,-1.221907,0.0,0.0
2,0.114244,0.0,-1.025095
3,-1.131351,0.0,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [53]:
df.fillna({1: 0.5,2:0}) # First row with 0.5 and the second with 0

Unnamed: 0,0,1,2
0,-2.344613,0.5,0.0
1,-1.221907,0.5,0.0
2,0.114244,0.5,-1.025095
3,-1.131351,0.5,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
0,-2.344613,0.5,0.0
1,-1.221907,0.5,0.0
2,0.114244,0.5,-1.025095
3,-1.131351,0.5,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [54]:
_ = df.fillna(0,inplace = True)

In [55]:
df

Unnamed: 0,0,1,2
0,-2.344613,0.0,0.0
1,-1.221907,0.0,0.0
2,0.114244,0.0,-1.025095
3,-1.131351,0.0,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


Unnamed: 0,0,1,2
0,-2.344613,0.0,0.0
1,-1.221907,0.0,0.0
2,0.114244,0.0,-1.025095
3,-1.131351,0.0,-0.678256
4,0.766331,-0.12959,-0.068216
5,-0.836181,1.547009,-0.814914
6,-1.246731,-0.062556,0.316337


In [56]:
# Another example
df = pd.DataFrame(np.random.randn(6,3))

In [57]:
df.iloc[2:,1] = NA

In [58]:
df

Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,,0.299682
3,-0.669773,,-0.068948
4,2.867191,,-0.560318
5,1.048889,,-0.515697


Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,,0.299682
3,-0.669773,,-0.068948
4,2.867191,,-0.560318
5,1.048889,,-0.515697


In [59]:
df.iloc[4:,2] = NA

In [60]:
df

Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,,0.299682
3,-0.669773,,-0.068948
4,2.867191,,
5,1.048889,,


Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,,0.299682
3,-0.669773,,-0.068948
4,2.867191,,
5,1.048889,,


In [61]:
df.fillna(method = 'ffill') # From the previous sessions you would rememeber ffill

Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,1.896402,0.299682
3,-0.669773,1.896402,-0.068948
4,2.867191,1.896402,-0.068948
5,1.048889,1.896402,-0.068948


Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,1.896402,0.299682
3,-0.669773,1.896402,-0.068948
4,2.867191,1.896402,-0.068948
5,1.048889,1.896402,-0.068948


In [62]:
# Fills with a limit
df.fillna(method='ffill', limit = 2)

Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,1.896402,0.299682
3,-0.669773,1.896402,-0.068948
4,2.867191,,-0.068948
5,1.048889,,-0.068948


Unnamed: 0,0,1,2
0,0.473315,-0.801401,-0.399755
1,-1.276271,1.896402,-1.338874
2,1.161175,1.896402,0.299682
3,-0.669773,1.896402,-0.068948
4,2.867191,,-0.068948
5,1.048889,,-0.068948


In [63]:
data = pd.Series([1 ,NA,3.5,NA,7])

In [64]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [65]:
# Filling the missing values with mean
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

![image.png](attachment:image.png)

In [66]:
# Data Transformation - Removing Duplicates
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

In [67]:
data # Values with distinct elements or keys are neglected

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [68]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [69]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [70]:
data['v'] =  range(7)

In [71]:
data

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [72]:
data.drop_duplicates(['k1']) # Filtering duplicated only based on 'k1' column

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1


Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep='last' will return the last one

In [73]:
data.drop_duplicates(['k1','k2'], keep ='last')

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


In [74]:
# Transforming Data Using a Function or Mapping
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham', 'nova lox']
                     ,'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [75]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [76]:
meat_from = {'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'}

In [77]:
# As you can see that some values are capitalized and some are not 
lowercased = data['food'].str.lower()

In [78]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [79]:
data['animal'] = lowercased.map(meat_from)

In [80]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [81]:
#This is a short function explaining the above logic
data['food'].map(lambda x: meat_from[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

## Replacing Values
Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s consider
this Series

In [82]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [83]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [84]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [85]:
# For two values to be replaced
data.replace([-999,-1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [86]:
# Or by
data.replace({-999: np.nan, -1000 : 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [87]:
# Renaming Axix Indexes
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [88]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [89]:
transform = lambda x : x[:4].upper()

In [90]:
transform

<function __main__.<lambda>(x)>

<function __main__.<lambda>(x)>

In [91]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [92]:
data.index = data.index.map(transform)

In [93]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [94]:
data.rename(index = str.title, columns = str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [95]:
# Or
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [96]:
# You wish to modify a dataset in-place
data.rename(index = {'OHIO':'INDIANA'}, inplace = True)

In [97]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [98]:
# discretization and Binning

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)

In [99]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [100]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [101]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [102]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False

In [103]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [104]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [105]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

In [106]:
data = np.random.rand(20)

In [107]:
# The precision=2 option limits the decimal precision to two digits
pd.cut(data, 4, precision=2)

[(0.74, 0.97], (0.74, 0.97], (0.26, 0.5], (0.74, 0.97], (0.5, 0.74], ..., (0.74, 0.97], (0.26, 0.5], (0.019, 0.26], (0.26, 0.5], (0.5, 0.74]]
Length: 20
Categories (4, interval[float64]): [(0.019, 0.26] < (0.26, 0.5] < (0.5, 0.74] < (0.74, 0.97]]

[(0.74, 0.97], (0.74, 0.97], (0.26, 0.5], (0.74, 0.97], (0.5, 0.74], ..., (0.74, 0.97], (0.26, 0.5], (0.019, 0.26], (0.26, 0.5], (0.5, 0.74]]
Length: 20
Categories (4, interval[float64]): [(0.019, 0.26] < (0.26, 0.5] < (0.5, 0.74] < (0.74, 0.97]]

In [108]:
data = np.random.randn(1000)

In [109]:
cats = pd.qcut(data, 4) # Cut into quartiles

In [110]:
cats

[(-0.647, 0.0163], (-0.647, 0.0163], (-2.731, -0.647], (0.0163, 0.712], (0.0163, 0.712], ..., (-0.647, 0.0163], (-0.647, 0.0163], (-2.731, -0.647], (0.712, 3.533], (-2.731, -0.647]]
Length: 1000
Categories (4, interval[float64]): [(-2.731, -0.647] < (-0.647, 0.0163] < (0.0163, 0.712] < (0.712, 3.533]]

[(-0.647, 0.0163], (-0.647, 0.0163], (-2.731, -0.647], (0.0163, 0.712], (0.0163, 0.712], ..., (-0.647, 0.0163], (-0.647, 0.0163], (-2.731, -0.647], (0.712, 3.533], (-2.731, -0.647]]
Length: 1000
Categories (4, interval[float64]): [(-2.731, -0.647] < (-0.647, 0.0163] < (0.0163, 0.712] < (0.712, 3.533]]

In [111]:
pd.value_counts(cats)

(0.712, 3.533]      250
(0.0163, 0.712]     250
(-0.647, 0.0163]    250
(-2.731, -0.647]    250
dtype: int64

(0.712, 3.533]      250
(0.0163, 0.712]     250
(-0.647, 0.0163]    250
(-2.731, -0.647]    250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive)

In [112]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.294, 0.0163], (-1.294, 0.0163], (-1.294, 0.0163], (0.0163, 1.279], (0.0163, 1.279], ..., (-1.294, 0.0163], (-1.294, 0.0163], (-1.294, 0.0163], (0.0163, 1.279], (-2.731, -1.294]]
Length: 1000
Categories (4, interval[float64]): [(-2.731, -1.294] < (-1.294, 0.0163] < (0.0163, 1.279] < (1.279, 3.533]]

[(-1.294, 0.0163], (-1.294, 0.0163], (-1.294, 0.0163], (0.0163, 1.279], (0.0163, 1.279], ..., (-1.294, 0.0163], (-1.294, 0.0163], (-1.294, 0.0163], (0.0163, 1.279], (-2.731, -1.294]]
Length: 1000
Categories (4, interval[float64]): [(-2.731, -1.294] < (-1.294, 0.0163] < (0.0163, 1.279] < (1.279, 3.533]]

In [113]:
# Detecting and Filtering Outliers
data = pd.DataFrame(np.random.randn(1000, 4))

In [114]:
data

Unnamed: 0,0,1,2,3
0,-1.224877,0.950989,1.214152,-0.937690
1,-0.199034,-0.854461,2.430224,0.147573
2,2.045437,0.234439,-0.481940,0.416740
3,0.586989,1.604809,0.239925,0.514013
4,0.501693,-0.603688,0.540451,1.753702
...,...,...,...,...
995,1.750143,-1.757571,-0.478223,-0.007695
996,1.202331,0.474920,0.448322,-0.592202
997,1.814061,0.559837,-0.586190,1.740230
998,1.394448,-0.930025,0.369897,1.094387


Unnamed: 0,0,1,2,3
0,-1.224877,0.950989,1.214152,-0.937690
1,-0.199034,-0.854461,2.430224,0.147573
2,2.045437,0.234439,-0.481940,0.416740
3,0.586989,1.604809,0.239925,0.514013
4,0.501693,-0.603688,0.540451,1.753702
...,...,...,...,...
995,1.750143,-1.757571,-0.478223,-0.007695
996,1.202331,0.474920,0.448322,-0.592202
997,1.814061,0.559837,-0.586190,1.740230
998,1.394448,-0.930025,0.369897,1.094387


In [115]:
col = data[2]

In [116]:
col

0      1.214152
1      2.430224
2     -0.481940
3      0.239925
4      0.540451
         ...   
995   -0.478223
996    0.448322
997   -0.586190
998    0.369897
999   -0.824532
Name: 2, Length: 1000, dtype: float64

0      1.214152
1      2.430224
2     -0.481940
3      0.239925
4      0.540451
         ...   
995   -0.478223
996    0.448322
997   -0.586190
998    0.369897
999   -0.824532
Name: 2, Length: 1000, dtype: float64

In [117]:
pd.value_counts(np.abs(col) > 3)

False    997
True       3
Name: 2, dtype: int64

False    997
True       3
Name: 2, dtype: int64

In [118]:
col[np.abs(col) > 3] # For all the value with are abs 3 in a coloumn

55    -3.520768
290   -3.095489
954   -3.544035
Name: 2, dtype: float64

55    -3.520768
290   -3.095489
954   -3.544035
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any

In [119]:
data[(np.abs(data) > 3).any(1)] # Any row with has abs 3

Unnamed: 0,0,1,2,3
55,0.213328,-1.351974,-3.520768,0.549863
209,-3.261149,-0.612673,-1.929849,0.502875
264,2.849744,3.028919,1.045825,-0.915662
285,0.260293,3.641603,0.244538,1.144504
290,-0.096579,-0.173146,-3.095489,-0.994526
299,-1.166162,3.527678,0.168135,0.553699
375,-0.271535,-0.769963,0.316418,3.523412
707,-0.36802,3.127179,1.048551,0.467242
737,3.019186,1.521593,0.391267,0.168081
779,-3.576781,-1.267646,-0.082716,-0.5841


Unnamed: 0,0,1,2,3
55,0.213328,-1.351974,-3.520768,0.549863
209,-3.261149,-0.612673,-1.929849,0.502875
264,2.849744,3.028919,1.045825,-0.915662
285,0.260293,3.641603,0.244538,1.144504
290,-0.096579,-0.173146,-3.095489,-0.994526
299,-1.166162,3.527678,0.168135,0.553699
375,-0.271535,-0.769963,0.316418,3.523412
707,-0.36802,3.127179,1.048551,0.467242
737,3.019186,1.521593,0.391267,0.168081
779,-3.576781,-1.267646,-0.082716,-0.5841


In [120]:
pd.value_counts((np.abs(data) > 3).any(1))

False    987
True      13
dtype: int64

False    987
True      13
dtype: int64

In [121]:
data[np.abs(data) > 3] = np.sign(data) * 3

statement np.sign(data) produces 1 and –1 values based on whether the values in data are positive or negative

In [122]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.047638,-0.015726,-0.071798,0.000681
std,1.008642,1.028094,1.018321,1.009458
min,-3.0,-3.0,-3.0,-2.958576
25%,-0.720742,-0.764458,-0.790327,-0.685187
50%,-0.040514,-0.032584,-0.091725,-0.005453
75%,0.637519,0.676998,0.594802,0.697198
max,3.0,3.0,2.994512,3.0


Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.047638,-0.015726,-0.071798,0.000681
std,1.008642,1.028094,1.018321,1.009458
min,-3.0,-3.0,-3.0,-2.958576
25%,-0.720742,-0.764458,-0.790327,-0.685187
50%,-0.040514,-0.032584,-0.091725,-0.005453
75%,0.637519,0.676998,0.594802,0.697198
max,3.0,3.0,2.994512,3.0


In [123]:
np.sign(data)*3

Unnamed: 0,0,1,2,3
0,-3.0,3.0,3.0,-3.0
1,-3.0,-3.0,3.0,3.0
2,3.0,3.0,-3.0,3.0
3,3.0,3.0,3.0,3.0
4,3.0,-3.0,3.0,3.0
...,...,...,...,...
995,3.0,-3.0,-3.0,-3.0
996,3.0,3.0,3.0,-3.0
997,3.0,3.0,-3.0,3.0
998,3.0,-3.0,3.0,3.0


Unnamed: 0,0,1,2,3
0,-3.0,3.0,3.0,-3.0
1,-3.0,-3.0,3.0,3.0
2,3.0,3.0,-3.0,3.0
3,3.0,3.0,3.0,3.0
4,3.0,-3.0,3.0,3.0
...,...,...,...,...
995,3.0,-3.0,-3.0,-3.0
996,3.0,3.0,3.0,-3.0
997,3.0,3.0,-3.0,3.0
998,3.0,-3.0,3.0,3.0


In [124]:
# Permutation and Random Sampling
df = pd.DataFrame(np.arange(5*4).reshape(5,4))

In [125]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new
ordering

In [126]:
sample = np.random.permutation(5)

In [127]:
sample

array([4, 1, 0, 2, 3])

array([4, 1, 0, 2, 3])

In [128]:
df.take(sample) # As you can see the order changed

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15


Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15


In [129]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
4,16,17,18,19


Unnamed: 0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
4,16,17,18,19


In [130]:
choices = pd.Series([5,7,1,3,4])

In [131]:
draws = choices.sample(n =10, replace =True)

In [132]:
draws

3    3
1    7
2    1
4    4
1    7
3    3
0    5
2    1
0    5
0    5
dtype: int64

3    3
1    7
2    1
4    4
1    7
3    3
0    5
2    1
0    5
0    5
dtype: int64

In [133]:
# Dummy Variables
df = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

In [134]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [135]:
dummy = pd.get_dummies(df['key'], prefix = 'yoo')

In [136]:
dummy

Unnamed: 0,yoo_a,yoo_b,yoo_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


Unnamed: 0,yoo_a,yoo_b,yoo_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [137]:
df_with_dummy = df[['data1']].join(dummy)

In [138]:
df_with_dummy

Unnamed: 0,data1,yoo_a,yoo_b,yoo_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


Unnamed: 0,data1,yoo_a,yoo_b,yoo_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


## String Manipulation

In [139]:
Val = 'a,b,   guido'

In [140]:
Val.split(',')

['a', 'b', '   guido']

['a', 'b', '   guido']

In [141]:
piece = [x.strip() for x in Val.split(',')]

In [142]:
piece

['a', 'b', 'guido']

['a', 'b', 'guido']

In [143]:
first, second, third = piece
first + '::' + second + '::' + third

'a::b::guido'

'a::b::guido'

In [144]:
# Or by
'::'.join(piece)

'a::b::guido'

'a::b::guido'

In [145]:
'guido' in Val

True

True

In [146]:
Val.index(',')

1

1

In [147]:
Val.find(':')

-1

-1

![image.png](attachment:image.png)

In [148]:
import re
text = "foo bar\t baz \tqux" 

In [149]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

['foo', 'bar', 'baz', 'qux']

In [150]:
regex = re.compile('\s+')

In [151]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

['foo', 'bar', 'baz', 'qux']

In [152]:
regex.findall(text)

[' ', '\t ', ' \t']

[' ', '\t ', ' \t']

In [153]:
# Lets see anther example
text = """Dave dave@google.com 
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [154]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [155]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [156]:
m = regex.search(text)

In [157]:
m

<re.Match object; span=(5, 20), match='dave@google.com'>

<re.Match object; span=(5, 20), match='dave@google.com'>

In [158]:
text[m.start():m.end()]

'dave@google.com'

'dave@google.com'

In [159]:
text[m.start():]

'dave@google.com \nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

'dave@google.com \nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

In [160]:
print(regex.sub('', text))

Dave  
Steve 
Rob 
Ryan 

Dave  
Steve 
Rob 
Ryan 



In [161]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [162]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [163]:
m = regex.match('wesm@bright.net')

In [164]:
m.groups()

('wesm', 'bright', 'net')

('wesm', 'bright', 'net')

In [165]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [166]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com 
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Dave Username: dave, Domain: google, Suffix: com 
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



![image.png](attachment:image.png)

In [167]:
# Vectorized String Functions in pandas

In [168]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}

In [169]:
data = pd.Series(data)

In [170]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [171]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [172]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [173]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [174]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

In [175]:
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

![image.png](attachment:image.png)
![image.png](attachment:image.png)

# Data Wrangling

In [176]:
# Hiararchical Indexing
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])

In [177]:
data

a  1   -0.472796
   2    1.468871
   3    2.194933
b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
d  2    1.326508
   3   -0.804462
dtype: float64

a  1   -0.472796
   2    1.468871
   3    2.194933
b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
d  2    1.326508
   3   -0.804462
dtype: float64

In [178]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [179]:
data['b']

1   -0.919549
3   -1.616940
dtype: float64

1   -0.919549
3   -1.616940
dtype: float64

In [180]:
# Slicing
data['b':'c']

b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
dtype: float64

b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
dtype: float64

In [181]:
data.loc[['b','d']]

b  1   -0.919549
   3   -1.616940
d  2    1.326508
   3   -0.804462
dtype: float64

b  1   -0.919549
   3   -1.616940
d  2    1.326508
   3   -0.804462
dtype: float64

In [182]:
data.loc[:,2]

a    1.468871
c   -0.029902
d    1.326508
dtype: float64

a    1.468871
c   -0.029902
d    1.326508
dtype: float64

In [183]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.472796,1.468871,2.194933
b,-0.919549,,-1.61694
c,0.158249,-0.029902,
d,,1.326508,-0.804462


Unnamed: 0,1,2,3
a,-0.472796,1.468871,2.194933
b,-0.919549,,-1.61694
c,0.158249,-0.029902,
d,,1.326508,-0.804462


In [184]:
data.unstack().stack()

a  1   -0.472796
   2    1.468871
   3    2.194933
b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
d  2    1.326508
   3   -0.804462
dtype: float64

a  1   -0.472796
   2    1.468871
   3    2.194933
b  1   -0.919549
   3   -1.616940
c  1    0.158249
   2   -0.029902
d  2    1.326508
   3   -0.804462
dtype: float64

In [185]:
# Let us see another example
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])

In [186]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [187]:
frame.index.names = ['ked','ked2']

In [188]:
frame.columns.names = ['state','color']

In [189]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked,ked2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked,ked2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [190]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
ked,ked2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


Unnamed: 0_level_0,color,Green,Red
ked,ked2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [191]:
frame.swaplevel('ked','ked2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked2,ked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked2,ked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [192]:
frame.sort_index(level = 1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked,ked2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked,ked2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [193]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked2,ked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
ked2,ked,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


In [194]:
frame.sum(level='ked')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
ked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


state,Ohio,Ohio,Colorado
color,Green,Red,Green
ked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [195]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
ked,ked2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


Unnamed: 0_level_0,color,Green,Red
ked,ked2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


## Combining and Merging Datasets
Data contained in pandas objects can be combined together in a number of ways:

• pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.

• pandas.concat concatenates or “stacks” together objects along an axis.

• The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.


In [196]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

In [197]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [198]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [199]:
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [200]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [201]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})

In [202]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [203]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


![image.png](attachment:image.png)

In [204]:
# Many-to-many merge
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                             'data2': range(5)})

In [205]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [206]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [207]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


In [208]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


In [209]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})

In [210]:
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

In [211]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


![image.png](attachment:image.png)

In [212]:
# concatenation (which is totally different from merging)
arr =np.arange(12).reshape((3,4))

In [213]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [214]:
np.concatenate([arr,arr], axis = 1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [215]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [216]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [217]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [218]:
s4 = pd.concat([s1, s3])

In [219]:
s4

a    0
b    1
f    5
g    6
dtype: int64

a    0
b    1
f    5
g    6
dtype: int64

In [220]:
pd.concat([s1,s4], axis = 1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


![image.png](attachment:image.png)
![image.png](attachment:image.png)

There are few more concepts, where one has to master like Combining and Merging datasets and Reshaping and Pivoting 