# Pandas and Numpy Features for Data Cleaning and Manipulation
### This notebook will give an introduction to manipulating data with Pandas and Numpy

In [1]:
#Importing and naming the packages
import pandas as pd
import numpy as np

# Lists

In [2]:
#Lists - an ordered series of data (Integer, Float, Objects, etc.)
a = ['Hello','World']
print(a)

['Hello', 'World']


#### Lists can be changed via native functions to Python.

In [3]:
#Via appending to the end
a.append('!')
print(a)

['Hello', 'World', '!']


In [4]:
#Or extending multiple items.
a.extend(['I','Am','A','List','.'])
print(a)

['Hello', 'World', '!', 'I', 'Am', 'A', 'List', '.']


In [5]:
#You can also remove an item
a.remove('World')
print(a)

['Hello', '!', 'I', 'Am', 'A', 'List', '.']


In [6]:
#Or put one back in
a.insert(1,1) #Location, Value
print(a)

['Hello', 1, '!', 'I', 'Am', 'A', 'List', '.']


In [7]:
#Count a number of entries
a.count('Hello')

1

In [8]:
#Organize by Value, only String!
a.remove(1)
a.sort()
print(a)

['!', '.', 'A', 'Am', 'Hello', 'I', 'List']


In [9]:
#Switch the order around
a.reverse()
print(a)

['List', 'I', 'Hello', 'Am', 'A', '.', '!']


In [10]:
#Grab a single item
a.pop(0)

'List'

In [11]:
#Or clear the entire thing
a.clear()
print(a)

[]


### Practice
#### Create an empty list. Add a set of five numbers to it. Then, iterate through that list and multiply each number by 5. Finally, print the list.

### Lambda Functions
Lambda functions will serve as the basis of the data generation we will be doing
in our regression and visualization talks.

Lambda functions allow you to define anonymous functions for one time use. Map functions house another function and a list that
is being iterated over. Casting to a list ensures the datatype stored is not a numpy array or pandas dataframe, but a native
list.

To go over some of the syntax, a lambda function looks like $lambda\ l1,\ l2:\ l1+l2,\ L1,\ L2$. What this creates is a function
that iterates over both lists L1 and L2, given they are the same shape, that allows you to return the sum of the two. Let's see.

In [13]:
x = [0,1,2,3,4,5]
lambda x: x**x

<function __main__.<lambda>(x)>

Note, this didn't do anything. As we mentioned above, we must wrap this in a map function to apply this over an iterable.

In [14]:
print(map(lambda x: x**x, [0,1,2,3,4,5]))

<map object at 0x1146f5710>


So we can now iterate a function over a list, but we want to be able to return another list. Let's complete this with out cast.

In [15]:
print(list(map(lambda x: x**x, [0,1,2,3,4,5])))

[1, 1, 4, 27, 256, 3125]


### Practice
#### Create two lists of five numbers. Then using a mapped lambda function, iterate through that list and multiply each number by the number in the same position of the other list. Finally sort the values and print them.

# Numpy
### Numpy arrays are similar. They are also an ordered series of items.

In [23]:
#They can be created directly from Lists
a = np.array([1, 2, 3, 4, 5])
print(a)

[1 2 3 4 5]


In [24]:
#And have multiple dimensions
a = np.array([[1,2],[3,4]])
print(a)

[[1 2]
 [3 4]]


In [30]:
#And have many more features
#Creating zeros
a = np.zeros(10, dtype=float)
print(a,': zeros\n')

#Creating ones
b = np.ones((3, 3), dtype=int)
print(b,': ones\n')

#For any other number
c = np.full((3, 3), 2.92)
print(c,': 2.92s\n')

#For a range of numbers
d = np.arange(1, 11)
print(d,': 1 - 11\n')

#For evenly spaced between two numbers
e = np.linspace(0, np.pi*2, 5)
print(e,': linearly spaced\n')

#For Random Numbers
np.random.seed(123)
f = np.random.random(5)
print(f,': random numbers\n')

#For Normal Random
g = np.random.randn(5)
print(g,': normal random\n')

#Random integers between 0 and 10
h = np.random.randint(0, 10, size=5)
print(h,': integer random')

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] : zeros

[[1 1 1]
 [1 1 1]
 [1 1 1]] : ones

[[2.92 2.92 2.92]
 [2.92 2.92 2.92]
 [2.92 2.92 2.92]] : 2.92s

[ 1  2  3  4  5  6  7  8  9 10] : 1 - 11

[0.         1.57079633 3.14159265 4.71238898 6.28318531] : linearly spaced

[0.69646919 0.28613933 0.22685145 0.55131477 0.71946897] : random numbers

[ 0.32210607 -0.05151772 -0.20420096  1.97934843 -1.61930007] : normal random

[9 3 4 0 0] : integer random


In [28]:
#You can also apply computations over entire arrays
a = np.array([1, 2, 3, 4, 5])
b = np.sin(a*np.pi/2)
print(b)

[ 1.0000000e+00  1.2246468e-16 -1.0000000e+00 -2.4492936e-16
  1.0000000e+00]


### Practice
#### Create an array of the numbers  1 - 100, apply a modulus of 3 to the numbers. Print the results

# Pandas
### Pandas also has its own approach to list-like data with a label (automatically or manually applied) and a value

In [35]:
#Series from List
a = pd.Series([1, 3, 5, 7, 10])
print(a)

#Series of Single Value
b = pd.Series(1, index=['A', 'B', 'C'])
print(b)

#Series from Dictionary
c = pd.Series({1: 'A', 2: 'B', 3: 'C'})
print(c)

#Series of Random Integers
d = pd.Series(np.random.randint(10, size=5))
print(d)

0     1
1     3
2     5
3     7
4    10
dtype: int64
A    1
B    1
C    1
dtype: int64
1    A
2    B
3    C
dtype: object
0    4
1    1
2    7
3    3
4    2
dtype: int64


In [36]:
#Series can be calculated over without looping
b = a*2
print(b)

c = b%3
print(c)

0     2
1     6
2    10
3    14
4    20
dtype: int64
0    2
1    0
2    1
3    2
4    2
dtype: int64


In [37]:
#And also allow for quick checking of the items in the list
a = pd.Series({'A':1,'B':2})
print('C' in a)
print('A' in a)

False
True


### More often though, your data will work better as a DataFrame. This is essentially the way Pandas represents tabular data, such as you would see in SQL or Excel.

In [38]:
#DataFrame from Dictionary
df = pd.DataFrame([{'A': i, 'B': 2*i} for i in range(3)]) 
print(df)

#DataFrame from Series
population_series = pd.Series({'California': 38332521,
                        'Texas': 26448193,
                        'New York': 19651127,
                        'Florida': 19552860})
population_df = pd.DataFrame(population_series, columns=['Population'])
print(population_df)

#DataFrame from Dictionary
area_series = pd.Series({'California': 423967,
                  'Texas': 695662,
                  'New York': 141297,
                  'Florida': 170312,
                  'Illinois': 149995})
area_df = pd.DataFrame({'Population': population_series,
                  'area': area_series})
print(area_df)

   A  B
0  0  0
1  1  2
2  2  4
            Population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
            Population    area
California  38332521.0  423967
Florida     19552860.0  170312
Illinois           NaN  149995
New York    19651127.0  141297
Texas       26448193.0  695662


In [39]:
#You can also very quickly create new columns and formulas
df['C'] = df['B']*2
print(df)

   A  B  C
0  0  0  0
1  1  2  4
2  2  4  8


In [41]:
#As well as dropping them (Axis = 1 means columns, Axis = 2 means rows)
df = df.drop(0, axis=0)
print(df)

   A  B
1  1  2
2  2  4


### Through these main features of Pandas, you can create nearly anything you want. Lets work with some larger data

Data can be imported by reading a CSV or just about any filetype. For example:

*data = read_csv('~/Downloads/data.csv')*

This would read a CSV of name data.csv from your downloads folder on a Mac.

In our case, we will import some data from another package.

In [42]:
from sklearn import datasets

iris = datasets.load_iris()
data = iris.data
feature_names = iris.feature_names

data = pd.DataFrame(data,columns=feature_names)
data.head(5)
data.tail(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


### Let's learn a little more about this data

In [43]:
#This will give quick summary statistics about our columns
data.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Let's work on selecting data out of these 150 points

In [44]:
#Grabbing a row. Syntax: 'From:To(:Step)' Steps optional, with empty values being non-limited
row = data[:1]
print(row)

rows = data[0:5]
print("\n",rows)

all_rows = data[0:]
print("\n All Rows:",len(all_rows))

half_rows = data[0::2]
print("\n Every Other Row:",len(half_rows))

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

 All Rows: 150

 Every Other Row: 75


In [46]:
#More complicated operations can also be performed
reverse_rows = data[::-1]
reverse_columns = data[data.columns[::-1]]
print(reverse_rows.head(5))
print(all_rows.tail(5))
print(reverse_columns.head(5))

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
149                5.9               3.0                5.1               1.8
148                6.2               3.4                5.4               2.3
147                6.5               3.0                5.2               2.0
146                6.3               2.5                5.0               1.9
145                6.7               3.0                5.2               2.3
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8
   petal width (cm)  petal length (cm)  sepal width (cm)  sepal 

### Data can also be treated more generally by using loc and iloc methods.
### First, in a Series

In [47]:
#Explicit Indexing
a = pd.Series([0.25, 0.5, 0.75, 1.0],
            index=['A', 'B', 'C', 'D'])
print(a)
print(a['A'])
print(a.loc['B'])

A    0.25
B    0.50
C    0.75
D    1.00
dtype: float64
0.25
0.5


In [48]:
#Implicit Indexing
a = pd.Series([0.25, 0.5, 0.75, 1.0])
print(a)
print(a[0])
print(a.iloc[1])

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
0.25
0.5


### Now in a DataFrame

In [50]:
#Choosing Data using loc, using the same Syntax as above.
a = pd.DataFrame(np.random.randint(10, size=10),
                index=[1,3,5,9,11,13,15,17,19,21])

print(a)

#Using Explicit Index
print("Explicit\n",a.loc[:5])

#Using Implicit Index
print("\nImplicit\n",a.iloc[:5])

    0
1   6
3   1
5   5
9   6
11  2
13  1
15  8
17  3
19  5
21  0
Explicit
    0
1  6
3  1
5  5

Implicit
     0
1   6
3   1
5   5
9   6
11  2


### How about selecting data in higher dimensions?

In [51]:
#For the Iris data
rows = data.iloc[:5]
print('First five rows:\n',rows,'\n')

col_rows = data.iloc[:5,:2]
print('First five rows, first two columns:\n',col_rows)

First five rows:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2 

First five rows, first two columns:
    sepal length (cm)  sepal width (cm)
0                5.1               3.5
1                4.9               3.0
2                4.7               3.2
3                4.6               3.1
4                5.0               3.6


In [52]:
#Getting Exact Cells
cell = data.iloc[2,2]
print('Second row, second column:\n',cell)

Second row, second column:
 1.3


### So how can you change this data?
#### Numpy

In [56]:
#Numpy
a = np.random.randint(20, size=(4,3))
print(a)

[[ 9  2  3]
 [11  3 19]
 [ 6  9 14]
 [ 6 19  6]]


In [57]:
#Numpy links back to the original when you subset your data - this will change the original
a_part = a[:2,:2]
a_part[0,0] = 2
print(a)

[[ 2  2  3]
 [11  3 19]
 [ 6  9 14]
 [ 6 19  6]]


#### Pandas

In [58]:
#Pandas does not. This simply creates a new view of the data, a copy that will not change the original
a = pd.DataFrame(a)
a_part = a.iloc[:2,:2]
a_part[0,0] = 4
print(a)

    0   1   2
0   2   2   3
1  11   3  19
2   6   9  14
3   6  19   6


In [59]:
#But you can still explicitly go about changing these values by avoiding subsetting the data
a.iloc[0,0] = 4
print(a)

    0   1   2
0   4   2   3
1  11   3  19
2   6   9  14
3   6  19   6


### How then can you check the shape of your data to help locate points?

In [60]:
#Number of dimensions
print("Iris Dimensions: ", data.ndim,"\n")

#(Rows,Columns)
print("Iris Shape: ", data.shape,"\n")

#Number of Elements
print("Iris Size: ", data.size,"\n")

#Data Types
print("Iris Types:\n", data.dtypes)

Iris Dimensions:  2 

Iris Shape:  (150, 4) 

Iris Size:  600 

Iris Types:
 sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
dtype: object


### Lastly, you can manipulate your data by adding Boolean restrictions.

#### Say you want to know how many of the petals are above average for length.

In [63]:
#Start by grabbing out the data
lengths = pd.DataFrame(data.iloc[:,2])
lengths.head(5)
print(lengths.iloc[0])

petal length (cm)    1.4
Name: 0, dtype: float64


In [64]:
#Instead of:
above = 0
mean = float(lengths.mean(0))
for i in range(len(lengths)):
    if float(lengths.iloc[i]) > mean:
        above += 1
above

93

In [83]:
#You can:
data[data['petal length (cm)'] > data['petal length (cm)'].mean()].shape[0]

93

In [74]:
#And then you can add more restrictions:
data[(data['petal length (cm)'] > 1) & (data['petal length (cm)'] < 2)].shape[0]

49

## Some Practice:
### Tell me how many observations have above average petal length and width.

### Tell me what the average petal length is for those observations with petal width above average.