In [2]:
2+2

4

Numpy is an open source Python library used for scientific computing and provides a host of features that allow a Python programmer to work with high-performance arrays and matrices.

Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy array for the implementation of pandas data objects and shares many of its features. In addition, pandas builds upon functionality provided by NumPy. Both libraries belong to what is known as the SciPy stack, a set of Python libraries used for scientific computing. 

NumPy arrays

1. NumPy allows you to work with high-performance arrays and matrices. 
2. Its main data object is the ndarray, an N-dimensional array type which describes a collection of “items” of the same type. For example:


In [11]:
import numpy as np

a1 = np.array([1, 2, 3, 4, 5]) #defining the ndarray

a1

array([1, 2, 3, 4, 5])

ndarrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, 
which results in significantly higher performance than with looping constructs in Python.

pandas Series Object

1. The Series is the primary building block of pandas. 
2. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. Like an array, a Series can hold zero or more values of any single data type. A Series can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict as the data parameter of the Series constructor. 
This is an example of defining an ndarray:

In [12]:
import pandas as pd

a2 = pd.Series([1, 2, 3, 4, 5])

a2

0    1
1    2
2    3
3    4
4    5
dtype: int64

Differences between ndarrays and Series Objects

There are some differences worth noting between ndarrays and Series objects. 
- elements in NumPy arrays are accessed by their integer position, starting with zero for the first element. 
    A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array.
- aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.

In [13]:
#Printing versions

print(pd.__version__)
print(pd.show_versions(as_json=True))

0.24.2
{'system': {'commit': None, 'python': '3.7.3.final.0', 'python-bits': 64, 'OS': 'Linux', 'OS-release': '4.18.0-20-generic', 'machine': 'x86_64', 'processor': 'x86_64', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'en_IN', 'LOCALE': 'en_IN.ISO8859-1'}, 'dependencies': {'pandas': '0.24.2', 'pytest': '4.3.1', 'pip': '19.0.3', 'setuptools': '40.8.0', 'Cython': '0.29.6', 'numpy': '1.16.2', 'scipy': '1.2.1', 'pyarrow': None, 'xarray': None, 'IPython': '7.4.0', 'sphinx': '1.8.5', 'patsy': '0.5.1', 'dateutil': '2.8.0', 'pytz': '2018.9', 'blosc': None, 'bottleneck': '1.2.1', 'tables': '3.5.1', 'numexpr': '2.6.9', 'feather': None, 'matplotlib': '3.0.3', 'openpyxl': '2.6.1', 'xlrd': '1.2.0', 'xlwt': '1.3.0', 'xlsxwriter': '1.1.5', 'lxml.etree': '4.3.2', 'bs4': '4.7.1', 'html5lib': '1.0.1', 'sqlalchemy': '1.3.1', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.10', 's3fs': None, 'fastparquet': None, 'pandas_gbq': None, 'pandas_datareader': None, 'gcsfs': None}}
None


In [23]:
# Creating python list
mylist = list('abcedfghijklmnopqrstuvwxyz')
print (mylist)
print (type(mylist))

['a', 'b', 'c', 'e', 'd', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
<class 'list'>


In [24]:
# Creating numpy ndarray
myarr = np.arange(26)
print (myarr)
print (type(myarr))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25]
<class 'numpy.ndarray'>


In [25]:
# Creating python dictionary
mydict = dict(zip(mylist, myarr))
print(mydict)
print(type(mydict))

{'a': 0, 'b': 1, 'c': 2, 'e': 3, 'd': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25}
<class 'dict'>


In [26]:
# Creating Panda series from each of the above
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

a    0
b    1
c    2
e    3
d    4
dtype: int64



Pandas DataFrame 

is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. 
Pandas DataFrame consists of three principal components, the data, rows, and columns.

In [31]:
# Convert the index of a series into a column of a dataframe
mylist = list('abcedfghijklmnopqrstuvwxyz')
ser = pd.Series(mylist)
print(ser)
df1 = ser.to_frame()
print(df1.head())
df2 = ser.to_frame().reset_index()
print(df2.head())

0     a
1     b
2     c
3     e
4     d
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
dtype: object
   0
0  a
1  b
2  c
3  e
4  d
   index  0
0      0  a
1      1  b
2      2  c
3      3  e
4      4  d


In [39]:
# Setting name to Series Index
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
print (ser.name)
ser.name = 'alphabets'
print (ser.head())

None
0    a
1    b
2    c
3    e
4    d
Name: alphabets, dtype: object


In [40]:
# Get items of series A not present in B
# Input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Solution
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

In [41]:
# Get items not common to A and B

# Input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Solution
ser_u = pd.Series(np.union1d(ser1, ser2))  # union
ser_i = pd.Series(np.intersect1d(ser1, ser2))  # intersect
ser_u[~ser_u.isin(ser_i)]

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

Sampling using Random data

https://stackoverflow.com/questions/45211624/what-exactly-does-the-pandas-random-state-do
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.normal.html


In [42]:
# Get the minimum, 25th percentile, median, 75th, and max of a numeric series
state = np.random.RandomState(100)
ser = pd.Series(state.normal(10, 5, 25)) #mean, sd, size

# Solution
np.percentile(ser, q=[0, 25, 50, 75, 100])

array([ 1.25117263,  7.70986507, 10.92259345, 13.36360403, 18.0949083 ])

In [47]:
# Frequency count of unique items
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
#ser.head()
ser.value_counts()

b    6
g    6
c    5
h    4
e    3
d    3
a    2
f    1
dtype: int64

In [48]:
# Keep the top 2 most frequent items and convert everything else to Other
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))

# Solution
print("Top 2 Freq:", ser.value_counts())
ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'
ser

Top 2 Freq: 2    5
4    4
1    2
3    1
dtype: int64


0         4
1         4
2         4
3     Other
4     Other
5     Other
6         2
7         4
8         2
9         2
10        2
11        2
dtype: object

In [49]:
# Binning and Labeling

ser = pd.Series(np.random.random(20))
print(ser.head())

pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()

0    0.644479
1    0.817385
2    0.342707
3    0.686355
4    0.195269
dtype: float64


0    6th
1    8th
2    4th
3    7th
4    1st
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

In [54]:
# Reshape the series ser into a dataframe with 7 rows and 5 columns

ser = pd.Series(np.random.randint(1, 10, 35))
print(ser.head(10))
df = pd.DataFrame(ser.values.reshape(7,5)) #it should divide properly
print(df)

0    7
1    1
2    1
3    1
4    3
5    9
6    9
7    4
8    5
9    9
dtype: int64
   0  1  2  3  4
0  7  1  1  1  3
1  9  9  4  5  9
2  7  6  8  5  3
3  8  2  3  3  3
4  1  4  6  1  1
5  1  4  3  2  5
6  5  4  5  5  7


In [58]:
# Find the positions of numbers that are multiples of 3 from ser

ser = pd.Series(np.random.randint(1, 10, 50))
print(ser.head(10))
print (np.argwhere(ser % 3==0))

0    7
1    8
2    9
3    8
4    1
5    6
6    5
7    3
8    2
9    8
dtype: int64
[[ 2]
 [ 5]
 [ 7]
 [14]
 [18]
 [19]
 [22]
 [24]
 [25]
 [33]
 [34]
 [40]
 [43]
 [44]
 [45]]


In [59]:
# Extract items at given positions from a series

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

ser.take(pos)

0     a
4     e
8     i
14    o
20    u
dtype: object

In [60]:
# Stack 2 series vertically and horizontally

# Input
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

# Vertical
df = ser1.append(ser2)
print(df)
# Horizontal
df = pd.concat([ser1, ser2], axis=1)
print(df)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object
   0  1
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e


In [61]:
# Fetch position of items of Series A in B

ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# Solution 1
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# Solution 2
[pd.Index(ser1).get_loc(i) for i in ser2]

[5, 4, 0, 8]

In [62]:
# Calculate Mean Squred Error from real/truth and prediction

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

np.mean((truth-pred)**2)

0.4332419008555493

In [64]:
# Character count

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

freq = ser.map(lambda x: len(x))

df = pd.concat([ser, freq], axis=1)

print(df)

      0  1
0   how  3
1    to  2
2  kick  4
3  ass?  4


In [65]:
# Compute difference of differences between consequtive numbers of a series
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

print(ser.diff().tolist())
print(ser.diff().diff().tolist())

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]


In [66]:
# Convert a series of date-strings to a timeseries

# Input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Solution 1
from dateutil.parser import parse
ser.map(lambda x: parse(x))

# Solution 2
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

In [67]:
# Get Day of month, week, year, day

# Input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Solution
from dateutil.parser import parse
ser_ts = ser.map(lambda x: parse(x))

# day of month
print("Date: ", ser_ts.dt.day.tolist())

# week number
print("Week number: ", ser_ts.dt.weekofyear.tolist())

# day of year
print("Day number of year: ", ser_ts.dt.dayofyear.tolist())

# day of week
print("Day of week: ", ser_ts.dt.weekday_name.tolist())

Date:  [1, 2, 3, 4, 5, 6]
Week number:  [53, 5, 9, 14, 19, 23]
Day number of year:  [1, 33, 63, 94, 125, 157]
Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


In [70]:
# Filter words that contain atleast 2 vowels

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Solution
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
print (mask)
print (ser[mask])

0     True
1     True
2    False
3    False
4     True
dtype: bool
0     Apple
1    Orange
4     Money
dtype: object


In [71]:
# Filter valid emails from a series

# Input
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

# Solution 1 (as series of strings)
import re
mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]

# Solution 2 (as series of list)
emails.str.findall(pattern, flags=re.IGNORECASE)

# Solution 3 (as list)
[x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]

['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

In [76]:
# Mean of a series grouped by another series

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

print(fruit)
print(weights)

print (weights.groupby(fruit).count())
# Solution
weights.groupby(fruit).mean()

0     apple
1    carrot
2    banana
3    banana
4     apple
5    carrot
6     apple
7     apple
8     apple
9     apple
dtype: object
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64
apple     6
banana    2
carrot    2
dtype: int64


apple     6.666667
banana    3.500000
carrot    4.000000
dtype: float64

In [77]:
# Fill an intermittent time series so all missing dates show up with values of previous non-missing date

ser = pd.Series([1,10,3, np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))

# Solution
ser.resample('D').ffill()  # fill with previous value

# Alternatives
ser.resample('D').bfill()  # fill with next value
ser.resample('D').bfill().ffill()  # fill next else prev value

2000-01-01     1.0
2000-01-02    10.0
2000-01-03    10.0
2000-01-04     3.0
2000-01-05     3.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     3.0
Freq: D, dtype: float64

In [79]:
# Compute auto correlation of a series
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
print(ser)

autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)

0     -1.050076
1     -2.480886
2      7.081634
3      8.405982
4      7.637698
5     18.075792
6      9.419959
7      0.237778
8     -9.511003
9     13.154110
10     7.406535
11    14.669242
12    -0.802335
13    13.975797
14    -5.294918
15    25.229852
16    17.376276
17    31.104219
18     3.292678
19    15.665627
dtype: float64
[-0.03, 0.3, -0.52, 0.33, -0.32, 0.6, -0.04, 0.08, -0.64, 0.24]
Lag having highest correlation:  9


In [85]:
# Read a dataset and create data frame
adver = pd.read_csv("Advertising.csv", usecols=[1, 2, 3, 4])
adver.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [86]:
# Change column values during a data set import 
df = pd.read_csv('Advertising.csv',
converters={'sales': lambda x: 'High' if float(x) > 20 else 'Low'})
df.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,High
1,2,44.5,39.3,45.1,Low
2,3,17.2,45.9,69.3,Low
3,4,151.5,41.3,58.5,Low
4,5,180.8,10.8,58.4,Low


In [90]:
# the row and column number of a particular cell with given criterion
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
print (df.head())

# Solution
# Get Manufacturer with highest price
df.loc[df.Price == np.max(df.Price), ['Manufacturer', 'Model', 'Type']]

# Get Row and Column number
row, col = np.where(df.values == np.max(df.Price))
print (row, col)

  Manufacturer    Model     Type  Min.Price  Price  Max.Price  MPG.city  \
0        Acura  Integra    Small       12.9   15.9       18.8      25.0   
1          NaN   Legend  Midsize       29.2   33.9       38.7      18.0   
2         Audi       90  Compact       25.9   29.1       32.3      20.0   
3         Audi      100  Midsize        NaN   37.7       44.6      19.0   
4          BMW     535i  Midsize        NaN   30.0        NaN      22.0   

   MPG.highway             AirBags DriveTrain  ... Passengers  Length  \
0         31.0                None      Front  ...        5.0   177.0   
1         25.0  Driver & Passenger      Front  ...        5.0   195.0   
2         26.0         Driver only      Front  ...        5.0   180.0   
3         26.0  Driver & Passenger        NaN  ...        6.0   193.0   
4         30.0                 NaN       Rear  ...        4.0   186.0   

   Wheelbase  Width  Turn.circle Rear.seat.room  Luggage.room  Weight  \
0      102.0   68.0         37.0     

In [None]:
# Rename the column Type as CarType in df and replace the ‘.’ in column names with ‘_’.
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
# Step 1:
df=df.rename(columns = {'Type':'CarType'})
# or
df.columns.values[2] = "CarType"

# Step 2:
df.columns = df.columns.map(lambda x: x.replace('.', '_'))
print(df.columns)

In [91]:
# Input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [6]:
# Linear Regression
X, y = adver.iloc[:, :-1], adver.iloc[:, -1]

import sklearn.model_selection as ms

X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.25, random_state=42)

In [7]:
import sklearn.linear_model as lm

In [8]:
regr = lm.LinearRegression()  # 1
regr.fit(X_train, y_train)    # 2
regr.score(X_test, y_test)    # 3

0.8935163320163657

In [10]:
import sklearn.svm as svm

svr = svm.LinearSVR(random_state=42)
svr.fit(X_train, y_train)
svr.score(X_test, y_test)

svr.predict(X_new)



array([28.90830568, 24.77693452, 27.87546289])