# Intro to Pandas
___

**pandas** provides data structures and data manipulation tools designed for fast & easy data cleaning and analysis.

pandas adopts array-based computing from NumPy, but is designed for **tabular** or **heterogenous** data.

pandas has two primary data structures:

- **Series** - one-dimensional array-like object with a sequence of values having the same datatype
- **DataFrame** - rectangular table of data with ordered collection of columns, each of which can be a different value type

In [1]:
#As allways we import pandas and numpy
import pandas as pd
import numpy as np

### Series

Has a sequence of values (all the same datatype) and an associated array of data labels called its **index**. If not specified otherwise, the index values are sequential integers.

pandas can automatically determine datatype of values when a Series is created, but datatype can also be specified.

A Series is like a fixed-length, ordered dict with a mapping of index values to data values.

The array representation and index object of a Series can be accessed via its **values** and **index** attributes.

Problem 1: Create a series 4 numerical values. Print the series, the values and the index of the series.

In [6]:
#We create a dataframe with some random data
data = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
         'age': [24, 30, 22, 35, 28],}
s1 = pd.Series([10, 20, 30, 40])
print("Problem 1 - series:")
print(s1)
print("values:", s1.values)
print("index:", s1.index)

Problem 1 - series:
0    10
1    20
2    30
3    40
dtype: int64
values: [10 20 30 40]
index: RangeIndex(start=0, stop=4, step=1)


Problem 2: Create a new series with the same values but with string values for the index. Select and print out one of the values in the series with the index.

In [5]:
#We create a dataframe with some random data
data = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
         'age': ["24", "30", "22", "35", "28"],}
s2 = pd.Series([10, 20, 30, 40], index=["a", "b", "c", "d"])
print("\nProblem 2 - series with string index:")
print(s2)
print("value at index 'c':", s2["c"])


Problem 2 - series with string index:
a    10
b    20
c    30
d    40
dtype: int64
value at index 'c': 30


Problem 3: You can also create a series from a python dict. Create a dict called `states_dict` with the state name as the key and the number as the value.
'Ohio' 35000
'Texas' 71000
'Oregon' 16000 
'Utah' 5000
Use the dict `states_dict` to create a series called `states_series`.

In [7]:
# ----------------------------
states_dict = {
    "Ohio": 35000,
    "Texas": 71000,
    "Oregon": 16000,
    "Utah": 5000
}
states_series = pd.Series(states_dict)
print("\nProblem 3 - states_series:")
print(states_series)



Problem 3 - states_series:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


Problem 4: Updated the index of `state_series` to be the abbreviation of the state names. Do this in place.

In [33]:
states_series.index = ["OH", "TX", "OR", "UT"]
print("\nProblem 4 - states_series with abbreviations:")
states_series


Problem 4 - states_series with abbreviations:


OH    35000
TX    71000
OR    16000
UT     5000
dtype: int64

### DataFrame

A pandas DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type.

Sort of a dict of Series all sharing the same index.

- DataFrames have both row and column indices
- DataFrames are physically 2D, but can represent higher-dimensional data using hierarchical indexing
- DataFrame rows are sometimes referred to as axis=0
- DataFrame columns are sometimes referred to as axis=1


You can create DataFrames from dicts as well. 

In [29]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
df = pd.DataFrame(data)
print("\nOriginal df:")
df



Original df:


Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Problem 5: Update the index of the DataFrame df to be the words for each number. i.e. 1 -> one 

In [31]:
df.index = ["one", "two", "three", "four", "five", "six"]
print("\nProblem 5 - df with word index:")
df



Problem 5 - df with word index:


Unnamed: 0,state,year,pop
one,Ohio,2000,1.5
two,Ohio,2001,1.7
three,Ohio,2002,3.6
four,Nevada,2001,2.4
five,Nevada,2002,2.9
six,Nevada,2003,3.2


Problem 6: Figure out two ways to access a column from df. Output the results. 

In [36]:
print("\nProblem 6 - access a column two ways:")
print(df["pop"])
print(df.pop)



Problem 6 - access a column two ways:
one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
six      3.2
Name: pop, dtype: float64
<bound method DataFrame.pop of         state  year  pop
one      Ohio  2000  1.5
two      Ohio  2001  1.7
three    Ohio  2002  3.6
four   Nevada  2001  2.4
five   Nevada  2002  2.9
six    Nevada  2003  3.2>


Problem 7: Figure out two ways to access a row from df. Output the results. 

In [12]:
print("\nProblem 7 - access a row two ways (row 'three'):")
print(df.loc["three"])
print(df.iloc[2])


Problem 7 - access a row two ways (row 'three'):
state    Ohio
year     2002
pop       3.6
Name: three, dtype: object
state    Ohio
year     2002
pop       3.6
Name: three, dtype: object


Problem 8: Add a new column to your dataframe called 'rating' with the values `[5,4,3,2,1,0]`

In [13]:
df["rating"] = [5, 4, 3, 2, 1, 0]
print("\nProblem 8 - df with rating:")
print(df)


Problem 8 - df with rating:
        state  year  pop  rating
one      Ohio  2000  1.5       5
two      Ohio  2001  1.7       4
three    Ohio  2002  3.6       3
four   Nevada  2001  2.4       2
five   Nevada  2002  2.9       1
six    Nevada  2003  3.2       0


Problem 9: Create another column called `nonsense` that is the rating multiplied by the pop.  

In [14]:
df["nonsense"] = df["rating"] * df["pop"]
print("\nProblem 9 - df with nonsense:")
print(df)


Problem 9 - df with nonsense:
        state  year  pop  rating  nonsense
one      Ohio  2000  1.5       5       7.5
two      Ohio  2001  1.7       4       6.8
three    Ohio  2002  3.6       3      10.8
four   Nevada  2001  2.4       2       4.8
five   Nevada  2002  2.9       1       2.9
six    Nevada  2003  3.2       0       0.0


Problem 10: Create three series using using numpy.

* series_numerical: The values 0-4 with index a-e
* series_zeros: All zeros with index a-e
* series_random: 5 random numbers qith index a-e

DataFrame called `numeric_df` from these three series. Each series will be a row and the columns will be a-e. 

In [15]:
idx = list("abcde")

series_numerical = pd.Series(np.arange(5), index=idx)
series_zeros = pd.Series(np.zeros(5), index=idx)
series_random = pd.Series(np.random.random(5), index=idx)

numeric_df = pd.DataFrame([series_numerical, series_zeros, series_random],
                          index=["series_numerical", "series_zeros", "series_random"])
print("\nProblem 10 - numeric_df (series are rows):")
print(numeric_df)


Problem 10 - numeric_df (series are rows):
                         a         b         c         d         e
series_numerical  0.000000  1.000000  2.000000  3.000000  4.000000
series_zeros      0.000000  0.000000  0.000000  0.000000  0.000000
series_random     0.714632  0.684364  0.882814  0.062811  0.968047


Problem 11: Transpose the DataFrame `numeric_df` so that the series become columns instead of rows. Save this to a variable called `transposed_numeric_df`. Rename the columns to `numerical`, `zeros`, and `random`.

In [17]:
transposed_numeric_df = numeric_df.T
transposed_numeric_df.columns = ["numerical", "zeros", "random"]
print("\nProblem 11 - transposed_numeric_df:")
print(transposed_numeric_df)


Problem 11 - transposed_numeric_df:
   numerical  zeros    random
a        0.0    0.0  0.714632
b        1.0    0.0  0.684364
c        2.0    0.0  0.882814
d        3.0    0.0  0.062811
e        4.0    0.0  0.968047


Problem 12: Output `transposed_numeric_df` ordered by `random` descending.

In [18]:
print("\nProblem 12 - ordered by random desc:")
print(transposed_numeric_df.sort_values("random", ascending=False))


Problem 12 - ordered by random desc:
   numerical  zeros    random
e        4.0    0.0  0.968047
c        2.0    0.0  0.882814
a        0.0    0.0  0.714632
b        1.0    0.0  0.684364
d        3.0    0.0  0.062811


Problem 13: Select all rows from `transposed_numeric_df` where the column `numerical` is greater than 2.  

In [19]:
print("\nProblem 13 - numerical > 2:")
print(transposed_numeric_df[transposed_numeric_df["numerical"] > 2])


Problem 13 - numerical > 2:
   numerical  zeros    random
d        3.0    0.0  0.062811
e        4.0    0.0  0.968047


Problem 14: Add a column called `random_5` that is the column random multiplied by 5. Select all rows from `transposed_numeric_df` where `numerical` is greater than `random_5`.

In [20]:
transposed_numeric_df["random_5"] = transposed_numeric_df["random"] * 5
print("\nProblem 14 - rows where numerical > random_5:")
print(transposed_numeric_df[transposed_numeric_df["numerical"] > transposed_numeric_df["random_5"]])


Problem 14 - rows where numerical > random_5:
   numerical  zeros    random  random_5
d        3.0    0.0  0.062811  0.314054


Problem 15: Add a column to transposed_numberic_df called `even` that is True when `numerical` is even and `False` when `numerical` is odd. 

In [21]:
transposed_numeric_df["even"] = (transposed_numeric_df["numerical"] % 2 == 0)
print("\nProblem 15 - with even column:")
print(transposed_numeric_df)


Problem 15 - with even column:
   numerical  zeros    random  random_5   even
a        0.0    0.0  0.714632  3.573158   True
b        1.0    0.0  0.684364  3.421820  False
c        2.0    0.0  0.882814  4.414071   True
d        3.0    0.0  0.062811  0.314054  False
e        4.0    0.0  0.968047  4.840236   True


Problem 16: Add a column called `even_odd` that has the value `odd` when `numerical` is odd and `even` when `numerical` is even.

In [22]:
transposed_numeric_df["even_odd"] = np.where(transposed_numeric_df["even"], "even", "odd")
print("\nProblem 16 - with even_odd column:")
print(transposed_numeric_df)


Problem 16 - with even_odd column:
   numerical  zeros    random  random_5   even even_odd
a        0.0    0.0  0.714632  3.573158   True     even
b        1.0    0.0  0.684364  3.421820  False      odd
c        2.0    0.0  0.882814  4.414071   True     even
d        3.0    0.0  0.062811  0.314054  False      odd
e        4.0    0.0  0.968047  4.840236   True     even


Problem 17: Print out the sum of all columns.

In [23]:
print("\nProblem 17 - sum of all columns:")
print(transposed_numeric_df.sum(numeric_only=True))


Problem 17 - sum of all columns:
numerical    10.000000
zeros         0.000000
random        3.312668
random_5     16.563340
even          3.000000
dtype: float64


Problem 18: Print out index of the row with the max value for each column. i.e. for `numerical` it will be the last row `e` because it has a value of 5. 

In [None]:
print("\nProblem 18 - index of max value for each column:")
print(transposed_numeric_df.idxmax(numeric_only=True))