# Setup and get data

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import datasets
import math
dir(sklearn.datasets)[15:20] # print all 
iris = sklearn.datasets.load_iris()
# convert to pandas df
iris = pd.DataFrame(np.concatenate((iris.data, np.array([iris.target]).T), axis=1), 
                    columns=iris.feature_names + ['target'])
# clean col names
iris.columns = [c.replace(' ', '_') for c in iris.columns]
iris.rename(columns={'sepal_length_(cm)': 'sepal_length', 
                     'sepal_width_(cm)': 'sepal_width', 
                     'petal_length_(cm)':  'petal_length',
                     'petal_width_(cm)': 'petal_width'}, inplace=True)

%run -i pandas_startup.py

pandas_startup() #set pandas options

In [2]:
iris[0:3]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0


# Sets

unordered and unindexed, with no duplicate values.
<br>being unordered and unindexed makes much faster than lists, when there are lots of elements.
<br> use for membership testing, removing duplicates from a sequence, and [set ops like intersection, union, difference](https://levelup.gitconnected.com/python-sets-basics-and-usecases-af1fbe8906f4).

In [1]:
data = { "hello", "bye", 10, 15 }
data

{10, 15, 'bye', 'hello'}

Because not indexed, can only access values by looping thru all.

# Lists

- the basic data structure
- Strings are lists of characters

Extend makes same list, append makes sub-list

In [21]:
primes = [2,3,5]
teen_primes=[11,13]
primes.extend(teen_primes) # extend keeps it in same overall list
print(primes)

[2, 3, 5, 11, 13]


In [22]:
primes.append(teen_primes) # append adds appended list as sub-list
print(primes)

[2, 3, 5, 11, 13, [11, 13]]


In [24]:
empty_list=[] # create an empty list
print(empty_list)

[]


List position of a value

In [39]:
primes.index(3)

1


### Subsetting lists

Named slices:

In [4]:
named_slice = slice(5, None)  # equivalent to [5:]

In [5]:
iris[named_slice]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
5,5.4,3.9,1.7,0.4,0.0
6,4.6,3.4,1.4,0.3,0.0
7,5.0,3.4,1.5,0.2,0.0
8,4.4,2.9,1.4,0.2,0.0
9,4.9,3.1,1.5,0.1,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


Lists can't do calculations
<br>Lists are flexibly typed (can do [True, "2", 3.0, 4]). Flexible but slow. If elements are all same type, [faster to do fixed-type](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html).

## So use fixed-type numpy arrays.

[numpy cheatsheet](https://s3.amazonaws.com/dq-blog-files/numpy-cheat-sheet.pdf); ["basics"](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html)

## Dictionaries

key -- value pairs. AKA Maps, Associative Arrays, HashMaps, and Hashtables.
- add concept of “keys”--non-numeric indices, to access elements
- Ex {'Key1': Value1, 'Key2': Value2, 'Key3': Value3}
- {‘Liz’: [1, 2, 5], ‘Alex’: [3, 7, 8]}



# pandas extends numpy to explicitly label rows and columns 


- Numpy: implicit indeger index: titanic[2]
- Pandas: explicit label: titanic.Age or titanic.['Age']



### Series 
are columns

Because index is explicit, can make it anything:

In [6]:
data = pd.Series([0.25, 0.5, 0.75],
                 index=[2, 5, "cat"])
data

2      0.25
5      0.50
cat    0.75
dtype: float64

So like a specialization of a Python dictionary (key-value pairs).A dictionary maps arbitrary keys to a set of arbitrary values, and a Series maps typed keys to a set of typed values. 
<br>So any fixed-type dict is easy to convert to a Series:

In [17]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
dtype: int64

Dicts already use label indexing, just like Series:

In [18]:
print(population_dict['California'])
population['California']

38332521


38332521

But Series also support array-style stuff like slicing, where dicts do not:

In [21]:
population_dict['California':'Texas']

TypeError: unhashable type: 'slice'

In [22]:
population['California':'Texas']

California    38332521
Texas         26448193
dtype: int64

### dataframes 
are collections of Series
<br>like a two-dimensional array with both flexible row indices and flexible column names

### pandas index object

In [23]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

Treat like array:

In [32]:
ind[2]

5

BUT IMMUTABLE. you can't change a single element:

In [35]:
iris.columns[0] = 'test'

TypeError: Index does not support mutable operations

Use df.rename.

**Row index**

In [25]:
iris.index

RangeIndex(start=0, stop=150, step=1)

In [27]:
for i in iris.index[0:5]:
    print(i)

0
1
2
3
4


**Column index**

In [28]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'], dtype='object')

# Subsetting in pandas

### Return a Series
Series: a one-dimensional sequence of labeled data. has index and data/values, NO columns.

In [44]:
iris['sepal_length'][0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [45]:
Series = iris['sepal_length'][0:2]

Doesn't have columns, so cannot select columns:

In [46]:
Series['sepal_length']

KeyError: 'sepal_length'

In [18]:
iris["sepal_length"][0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [21]:
iris.sepal_length[0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [24]:
iris.loc[:, 'sepal_length'][0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [29]:
iris.iloc[:, 0][0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [31]:
iris.get('sepal_length')[0:2]

0    5.1
1    4.9
Name: sepal_length, dtype: float64

### Return a DataFrame with columns

In [48]:
iris[['sepal_length']][0:2]

Unnamed: 0,sepal_length
0,5.1
1,4.9


In [47]:
Series = iris[['sepal_length']][0:2]

Has columns, so can select columns:

In [49]:
Series['sepal_length']

0    5.1
1    4.9
Name: sepal_length, dtype: float64

In [32]:
iris[['sepal_length','sepal_width']][0:2]

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
