# Pandas Series

- matplotlib, numpy, **pandas**, seaborn, scikit-learn, stats, scipy, ... 


## Some comparisons

______


- Numpy: Arrays
- Pandas: Series and Dataframes

___________ 

- Series: like a numpy array, but with some additional functionality. 


- Series: Imagine a single column of a table.  

- Dataframes: Imagine the entire table. 

______

How is a series different from a list? 

- Series contains an index, which can be thought of as a row name (often is a row number), which is a way to reference items. The index is stored with other meta-information (information about the series).   

- the elements are of a specific data type. The data type is inferred, but can be manually specified. 

_____ 



## Import Pandas

`import pandas as pd`

## Create a Series

1. from a list
2. from a numpy array
3. from a dataframe

**Create a Series from a List**

In a list, you can use an index to access value in list, and those indices are integers representing location. These indices in a list cannot be changed to be a name, datetime, etc. 

*but* if we create a series from the list, similar to how you would convert a list to an array with numpy...

we can create a series from the list...

what's inside the series? 

- 3 rows, with the row indices (or row names) as [0, 1, 2]
- the values are [2, 3, 5]
- the datatype is int64 (i.e. will store LARGE integers ;))


**Create a Series from an array**


In [None]:
# create series from array
my_series = pd.Series(my_array)

type(my_series)

In [None]:
my_series

- 3 rows, with the row indices as [0, 1, 2]
- the values are [8.0, 13.0, 21.0]
- the datatype is float64

**Create a Series from a dataframe**


In [None]:
# create the dataframe
my_df = data('sleepstudy')
my_df.head()

Create a Series from one column of the dataframe

**option 1: dot column_name**

`my_df.column_name`

In [None]:
my_series = my_df.Reaction
print(type(my_series))
# my_series

**option 2: single bracket**


`my_df[column_name]`

In [None]:
my_series = my_df['Reaction']
print(type(my_series))

**What happens when you do a double bracket?**

`my_df[[column_name]]`

In [None]:
my_dataframe_that_resembles_a_series = my_df[['Reaction']]
print(type(my_dataframe_that_resembles_a_series))

**See first 5 rows**

`my_series.head()`

In [None]:
my_series.head()

In [None]:
my_dataframe_that_resembles_a_series.head()

### Summary

Create series 

...from list, array, dictionary

In [None]:
myseries = pd.Series(my_list)
myseries

...from existing dataframe

In [None]:
myseries = my_df['Reaction']
myseries = my_df.Reaction
myseries

## Pandas data types

Data types you will see in series and dataframes: 

- int: integer, whole number values  
- float: decimal numbers  
- bool: true or false values  
- object: strings  
- category: a fixed set of string values  
- a name, an optional human-friendly name for the series  

There are 2 ways a new Series gets its data type...

1. Inference of data type
2. Assignment of data type

_______________________________

1. Inference of data types: when creating the series, Pandas infers the data type based on the data entered. 

In [None]:
# inferring

pd.Series([True, False, True])

In [None]:
pd.Series(['I', 'Love', 'Codeup'])

In [None]:
pd.Series([1, 3, 'five'])

In [None]:
my_series = pd.Series([1, 3, 'five'])

my_new_series = my_series[my_series != 'five']

my_new_series

2. Assignment of data type: Using `astype()` to assign data types

In [None]:
# `astype()`

my_new_series.astype('int')

If we try to assign a data type to data that cannot be coerced into that type, we will get an error. 
For example, If we assign the series, my_series, to an integer type, Pandas will return error becuase it can't convert 'five'to an int. 

In [None]:
# my_series.astype('int') 

In [None]:
# convert numeric type to string
subj_series = my_df['Subject'].astype('str')
subj_series

### Summary

- Pandas will infer datatypes
- You can change datatypes upon creating the series `pd.Series(mylist).astype('int')` or later using "astype(x)" where x can be 'float', 'int', 'str', e.g. `myseries.astype('str')`
- astype('str') will show the series dtype = object. 

_____________________________________
_____________________________________

## Vectorized Operations

Like numpy arrays, pandas series are vectorized by default. E.g., we can easily use the basic arithmetic operators to manipulate every element in the series.

First, let's create 2 series, s1 and s2:

In [None]:
s1 = pd.Series([2, 3, 5, 8])

s1.head()

In [None]:
s2 = my_df.Reaction
s2.head()

We will now perform arithmetic operations followed by comparison operations...

In [None]:
s1 + 1

In [None]:
s1/2

Now, comparison operations

In [None]:
s1 >= 5

In [None]:
(s1 >= 3) & (s1 % 2 == 0)

### Summary

- Just as in Numpy, we can perform operations on each element in the series by simply applying the series, s + 1, s/2, s == 3, etc. and each will be evaluated. 

- a series is always returned
- a series of booleans if we are giving condition statements. 
- a series of transformed values if we are doing an arithmetic operation. 


_______________________________________________________
_______________________________________________________

## Series Methods

### Methods to get glimpses into the series

`.head()`: returns the 1st 5 rows (max) of the series

In [None]:
s1.head()

In [None]:
s2.head()

`.tail()`: returns the last 5 rows of the series

In [None]:
s2.tail()

`.value_counts()`: count number of records/items/rows containing each unique value (think "group by")

```sql
select Days, count(Subject) from my_df group by Days;
```

In [None]:
s3 = my_df.Days

s3.value_counts()

### Methods to test whether a value or condition exists in the series

`.any()`: returns a single boolean...do any values in the series meet the condition? 

In [None]:
(s1 > 3).any()

`.all()`: returns a single boolean...do all values in the series meet the condition? 

In [None]:
(s1 > 3).all()

`.isin()`: comparing string of each item in series to a list of strings. Is the string in your series found in the list of strings? Returns a series of boolean values. 
In other words, use `isin()` to tell whether each value is in a set of known values. 

In [None]:
vowels = list('aeiouy')
letters = list('abcdefghijkeliminnow')
letters_series = pd.Series(letters)
letters_series

In [None]:
s3.value_counts()
s3

In [None]:
letters_series.isin(vowels).value_counts()

### Methods for Descriptive statistics

- `count()`
- `sum()`
- `mean()`
- `median()`
- `min()`
- `max()`
- `mode()`
- `abs()`
- `std()`
- `quantile()`

In [None]:
{
    'count': s2.count(),
    'sum': s2.sum(),
    'mean': s2.mean(),
    'median': s2.median()
    
}

In [None]:
s2.describe()

In [None]:
print(s1)
s1.describe()

### Applying other functions to each item in a series

1. Define the function: `def myfcn()`
2. Use the .apply method with lambda: `series.apply(lambda n: <myfuction>)`

In [None]:
def even_or_odd(n):
    '''
    this function takes a number and returns a string indicating 
    whether the passed number is even or odd
    '''
    if n % 2 == 0:
        return 'even'
    else:
        return 'odd'

s1.apply(even_or_odd)

In [None]:
s1.apply(lambda n: 'even' if n % 2 == 0 else 'odd')

### Vectorized String Operations

We could do this...`series.apply(lambda s: s.lower())`, but, why when we have...`series.str.lower()`

Once we access the .str property, we can treat the resulting value as if it were a string. In our case, we called the .lower method, which will convert all the strings in the series to lower case.

In [None]:
s4 = pd.Series(['Madeleine', 'Misty', 'John', 'John', 'Ryan', 'Ryan', 'Adam', 'Adam', 'Margaret'])

In [None]:
# s4.apply(lambda s: s.lower())
s4.str.lower()

In [None]:
s4 = s4.str.replace('rgaret', 'ggie')

In [None]:
s4

### Summary

Methods look like functions, but you can tell they are methods because they start with the series variable name followed by the .method_name(), such as my_series.count(). Whereas pandas functions will begin with pd.concat(s1, s2). 

We talked about 

1. Methods to get glimpses into the series

2. Methods to test whether a value or condition exists in the series

3. Methods for Descriptive statistics

4. Methods for Applying other functions to each item in a series

5. Vectorized String Operations

__________________________________________
__________________________________________


## Subsetting & Indexing

In [None]:
s2.describe()
# let's change the units from seconds to minutes for interpretability
s2_minutes = s2/60


# it's a series!
type(s2_minutes.describe())

Find the 75th percentile

In [None]:
s2_minutes.describe()['75%']

Find all where response time is > 3rd quartile, using the identified the cutoff for 3rd to 4th quartile. 

In [None]:
q3 = s2_minutes.describe()['75%']

# subsetted our series to only those values that are in the 4th quartile
s2_minutes[s2_minutes > q3]

Create a new series that contains labels of 'q4' if the value is above the 3rd q, or 'q1-q3' if not. 

In [None]:
s2_minutes.apply(lambda n: 'q4' if n > q3 else 'q1-q3')

Filter our letters series 

In [None]:
letters_series[letters_series.isin(vowels)]

## Numerical to Categorical Values - Binning & Cutting

`cut()` put numerical values into discrete bins. 

We can let pandas create bins of even intervals, or we can specify the bins to create. 

In [None]:
reaction_bins_auto_series = pd.cut(s2_minutes, 4)

In [None]:
reaction_bins_auto_series.value_counts()

In [None]:
reaction_bins_series = pd.cut(s2_minutes, [3, 4, 5, 6, 7, 8])

In [None]:
reaction_bins_series.value_counts()

## Plotting

In [None]:
import matplotlib.pyplot as plt
plt.hist(s2_minutes)
plt.show()

In [None]:
s2_minutes.plot()

In [None]:
s2_minutes.plot.hist()

In [None]:
reaction_bins_series.value_counts().plot.bar(color='slateblue', width=.9)
plt.title('Reaction Time Bins')