<a href="https://colab.research.google.com/github/mupungijose-hue/Data-Analysis-Projects/blob/main/section02_Series_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

## Pandas Series Cheat Sheet

This cheat sheet summarizes the key Pandas Series operations demonstrated in this notebook.

### Assignment 1: Series Basics

*   **Create a Series:**
    ```python
    pd.Series(data_array, name="series_name")
    ```
*   **Access Series Attributes:**
    *   `series.name`: Get the name of the Series.
    *   `series.dtype`: Get the data type of the Series values.
    *   `series.size`: Get the number of elements in the Series.
    *   `series.index`: Get the index of the Series.
*   **Calculate Mean:**
    ```python
    series.mean()
    ```
*   **Convert Data Type:**
    ```python
    series.astype("int") # Convert to integer type
    ```

### Assignment 2: Accessing Series Data

*   **Set Series Index:**
    ```python
    series.index = another_series_or_list_for_index
    ```
*   **Access by Integer Location (positional):**
    ```python
    series.iloc[start:end] # e.g., series.iloc[:10] for first 10
    ```
*   **Access by Label (index value):**
    ```python
    series.loc["label_start":"label_end"] # Inclusive slicing
    ```
*   **Reset Index:**
    ```python
    series.reset_index(drop=True) # Converts index to default integer index
    ```

### Assignment 3: Sorting and Filtering Series

*   **Sort by Values:**
    ```python
    series.sort_values(ascending=True/False)
    ```
*   **Sort by Index:**
    ```python
    series.sort_index(ascending=True/False)
    ```
*   **Filter using Boolean Mask:**
    ```python
    mask = (series.index.isin(list_of_labels)) & (series <= value)
    series.loc[mask]
    ```

### Assignment 4: Series Operations

*   **Element-wise Arithmetic:**
    ```python
    series.mul(1.1) # Multiply by 1.1
    series.add(2)   # Add 2
    # Or using operators:
    series * 1.1 + 2
    series - series.max()
    ```
*   **Get Max Value:**
    ```python
    series.max()
    ```
*   **Extract from String Index (using `.str` accessor):**
    ```python
    pd.Series(series.index).str[5:7].astype("int") # Extract month from 'YYYY-MM-DD'
    ```

### Assignment 5: Series Aggregations

*   **Sum of Values:**
    ```python
    series.sum()
    ```
*   **Mean of Values:**
    ```python
    series.mean()
    ```
*   **Count of Non-NA Values:**
    ```python
    series.count()
    ```
*   **Calculate Quantiles:**
    ```python
    series.quantile([0.1, 0.9]) # 10th and 90th percentiles
    ```
*   **Value Counts (and Normalize for Percentage):**
    ```python
    series.astype("int").value_counts(normalize=True)
    ```

### Assignment 6: Missing Data

*   **Introduce Missing Values (conditionally):**
    ```python
    series.where(~series.isin([value1, value2]), pd.NA)
    ```
*   **Count Missing Values:**
    ```python
    series.isna().sum()
    ```
*   **Fill Missing Values:**
    ```python
    series.fillna(series.median()) # Fill with median
    ```

### Assignment 7: Apply and Where

*   **Apply Custom Function:**
    ```python
    def custom_func(value, limit):
        # logic
        pass
    series.apply(custom_func, args=(limit_value,))
    # Using lambda:
    series.apply(lambda x: "Buy" if x < limit else "Wait")
    ```
*   **Conditional Operations with `.where()` (Pandas):**
    ```python
    series.where(condition, value_if_false)
    # Example: Apply 0.9 if index is in list, else 1.1
    (series
     .where(series.index.isin(dates_list), series * 1.1)
     .where(~series.index.isin(dates_list), series * .9)
    )
    ```
*   **Conditional Operations with `np.where()` (NumPy):**
    ```python
    import numpy as np
    pd.Series(
        np.where(
            condition, # Boolean array
            value_if_true, # Result if condition is true
            value_if_false # Result if condition is false
        )
    )
    ```


# Assignment 1: Series Basics

The code has been previded to create an array, `oil_array` from a dataframe column.

* Convert `oil_array` into a Pandas Series, called `oil_series`. Give it a name!
* Return the name, dtype, size, and index of `oil_series`.

Take the mean of the values array.

Then, convert the series to integer datatype and recalculate the mean.


In [None]:
# create a DataFrame from the oil file, drop missing values
oil = pd.read_csv("../retail/oil.csv").dropna()

# Grab 100 rows of oil prices
oil_array = np.array(oil["dcoilwtico"].iloc[1000:1100])

oil_array

array([52.22, 51.44, 51.98, 52.01, 52.82, 54.01, 53.8 , 53.75, 52.36,
       53.26, 53.77, 53.98, 51.95, 50.82, 52.19, 53.01, 52.36, 52.45,
       51.12, 51.39, 52.33, 52.77, 52.38, 52.14, 53.24, 53.18, 52.63,
       52.75, 53.9 , 53.55, 53.81, 53.01, 52.19, 52.37, 52.99, 53.84,
       52.96, 53.21, 53.11, 53.41, 53.41, 54.02, 53.61, 54.48, 53.99,
       54.04, 54.  , 53.82, 52.63, 53.33, 53.19, 52.68, 49.83, 48.75,
       48.05, 47.95, 47.24, 48.34, 48.3 , 48.34, 47.79, 47.02, 47.29,
       47.  , 47.3 , 47.02, 48.36, 49.47, 50.3 , 50.54, 50.25, 50.99,
       51.14, 51.69, 52.25, 53.06, 53.38, 53.12, 53.19, 52.62, 52.46,
       50.49, 50.26, 49.64, 48.9 , 49.22, 49.22, 48.96, 49.31, 48.83,
       47.65, 47.79, 45.55, 46.23, 46.46, 45.84, 47.28, 47.81, 47.83,
       48.86])

In [None]:
# convert oil_array to a series

oil_series = pd.Series(oil_array, name="oil_prices")

oil_series

0     52.22
1     51.44
2     51.98
3     52.01
4     52.82
      ...  
95    45.84
96    47.28
97    47.81
98    47.83
99    48.86
Name: oil_prices, Length: 100, dtype: float64

In [None]:
print(f"Name: {oil_series.name}")
print(f"dtype: {oil_series.dtype}")
print(f"size: {oil_series.size}")
print(f"index: {oil_series.index}")

Name: oil_prices
dtype: float64
size: 100
index: RangeIndex(start=0, stop=100, step=1)


In [None]:
oil_series.values.mean()

51.128299999999996

In [None]:
oil_series.index

RangeIndex(start=0, stop=100, step=1)

In [None]:
oil_series.index.dtype

dtype('int64')

In [None]:
oil_series.astype("int").values.mean()

50.66

# Assignment 2:  Accessing Series Data

* Set the date series, which has been created below, to be the index of the oil price series created in assignment 1.


* Then, take the mean of the first 10 and last 10 prices of the series.


* Finally, grab all oil prices from January 1st, 2017 - January 7th, 2017 (inclusive) and set the index to the default integer index.

In [None]:
dates = pd.Series(oil["date"]).iloc[1000:1100]

In [None]:
oil_series.index = dates

oil_series

date
2016-12-20    52.22
2016-12-21    51.44
2016-12-22    51.98
2016-12-23    52.01
2016-12-27    52.82
              ...  
2017-05-09    45.84
2017-05-10    47.28
2017-05-11    47.81
2017-05-12    47.83
2017-05-15    48.86
Name: oil_prices, Length: 100, dtype: float64

In [None]:
oil_series.iloc[:10]

date
2016-12-20    52.22
2016-12-21    51.44
2016-12-22    51.98
2016-12-23    52.01
2016-12-27    52.82
2016-12-28    54.01
2016-12-29    53.80
2016-12-30    53.75
2017-01-03    52.36
2017-01-04    53.26
Name: oil_prices, dtype: float64

In [None]:
# Mean of first 10 prices

oil_series.iloc[:10].mean()

52.765

In [None]:
# Mean of last 10 prices

oil_series.iloc[-10:].mean()

47.129999999999995

In [None]:
# Slice labels using loc, reset index and drop dates to return series w/ integer index

oil_series.loc["2017-01-01":"2017-01-07"].reset_index(drop=True)

0    52.36
1    53.26
2    53.77
3    53.98
Name: oil_prices, dtype: float64

# Assignment 3: Sorting and Filtering Series

* First, get the 10 lowest prices from the data.
* Sort the 10 lowest prices by date, starting with the most recent and ending with the oldest price.

* Finally, use the list of provided dates. Select only rows with these dates that had a price of less than 50 dollars per barrel.

In [None]:
# list of dates to be used to solve bullet 3

dates = [
    "2016-12-22",
    "2017-05-03",
    "2017-01-06",
    "2017-03-05",
    "2017-02-12",
    "2017-03-21",
    "2017-04-14",
    "2017-04-15",
]

In [None]:
# Get 10 lowest prices by grabbing first 10 rows of sorted price series
# Then, sort by index in descending order

oil_series.sort_values().iloc[:10].sort_index(ascending=False)

date
2017-05-10    47.28
2017-05-09    45.84
2017-05-08    46.46
2017-05-05    46.23
2017-05-04    45.55
2017-03-27    47.02
2017-03-23    47.00
2017-03-22    47.29
2017-03-21    47.02
2017-03-14    47.24
Name: oil_prices, dtype: float64

In [None]:
# Create mask to filter to only dates in list of dates and oil price <= 50

mask = oil_series.index.isin(dates) & (oil_series <= 50)

oil_series.loc[mask]

date
2017-03-21    47.02
2017-05-03    47.79
Name: oil_prices, dtype: float64


# Assignment 4: Series Operations

* Increase the prices in the oil series by 10%, and add an additional 2 dollars per barrel on top of that.

* Then, create a series that represents the difference between each price and max price.

* Finally, extract the month from the string dates in the index and store them as an integer in their own series.

In [None]:
# Multiple oil series values by 1.1 (10% increase), then add 2 to each row

# with Pandas methods
oil_series.mul(1.1).add(2)

# with Python operators
oil_series * 1.1 + 2

date
2016-12-20    59.442
2016-12-21    58.584
2016-12-22    59.178
2016-12-23    59.211
2016-12-27    60.102
               ...  
2017-05-09    52.424
2017-05-10    54.008
2017-05-11    54.591
2017-05-12    54.613
2017-05-15    55.746
Name: oil_prices, Length: 100, dtype: float64

In [None]:
# Get max price, store in variable

max_price = oil_series.max()

max_price

54.48

In [None]:
# Subtract max price from all rows in oil_series (returns a Series)
(oil_series - max_price) / max_price

date
2016-12-20   -0.041483
2016-12-21   -0.055800
2016-12-22   -0.045888
2016-12-23   -0.045338
2016-12-27   -0.030470
                ...   
2017-05-09   -0.158590
2017-05-10   -0.132159
2017-05-11   -0.122430
2017-05-12   -0.122063
2017-05-15   -0.103157
Name: oil_prices, Length: 100, dtype: float64

In [None]:
# Create a series from the index of oil_series
string_dates = pd.Series(oil_series.index)

In [None]:
# Slice out month portion of text string and convert to int
string_dates.str[5:7].astype("int")

0     12
1     12
2     12
3     12
4     12
      ..
95     5
96     5
97     5
98     5
99     5
Name: date, Length: 100, dtype: int64

In [None]:
# single line
pd.Series(oil_series.index).str[5:7].astype("int")

0     12
1     12
2     12
3     12
4     12
      ..
95     5
96     5
97     5
98     5
99     5
Name: date, Length: 100, dtype: int64

# Assignment 5: Series Aggregations

* Calculate the sum and mean of prices in the month of March.

* Next, calculate how many prices were recorded in January and February.

* Then, calculate the 10th and 90th percentiles across all data.

* Finally, how often did integer dollar value (e.g. 51, 52) occur in the data? Normalize this to a percentage.   

In [None]:
# Filter series to March (month 3), calculate sum of prices, and round

oil_series[oil_series.index.str[6:7] == "3"].sum().round(2)

1134.54

In [None]:
# Filter series to march, calculate mean

oil_series[oil_series.index.str[6:7] == "3"].mean()

49.32782608695651

In [None]:
# Filter series to Jan and Feb, count entries

oil_series[oil_series.index.str[5:7].isin(["01", "02"])].count()

39

In [None]:
# Calculate 10th and 90th percentiles of oil series using quantile

oil_series.quantile([0.1, 0.9])

0.1    47.299
0.9    53.811
Name: oil_prices, dtype: float64

In [None]:
# Return normalized value counts to get percentage of time each integer dollar value occurred

oil_series.astype("int").value_counts(normalize=True)

53    0.26
52    0.22
47    0.13
48    0.10
51    0.07
50    0.07
49    0.06
54    0.05
45    0.02
46    0.02
Name: oil_prices, dtype: float64

# Assignment 6: Missing Data

There were some erroneous prices in our data, so they were filled in with missing values.

Can you confirm the number of missing values in the price column?

Once you’ve done that, fill the prices in with the median of the oil price series.


In [None]:
# Fill in two values with missing data
oil_series = oil_series.where(~oil_series.isin([51.44, 47.83]), pd.NA)

In [None]:
# Sum/count missing values

oil_series.isna().sum()

2

In [None]:
# Fill in missing values with median

oil_series.fillna(oil_series.median())

date
2016-12-20    52.220
2016-12-21    52.205
2016-12-22    51.980
2016-12-23    52.010
2016-12-27    52.820
               ...  
2017-05-09    45.840
2017-05-10    47.280
2017-05-11    47.810
2017-05-12    52.205
2017-05-15    48.860
Name: oil_prices, Length: 100, dtype: float64

# Exercise 7: Apply and Where

Write a function that outputs ‘buy’ if price is less than the 90th percentile and ‘wait’ if it’s not. Apply it to the oil series.

Then, create a series that multiplies price by .9 if the date is ‘2016-12-23’ or ‘2017-05-10’, and 1.1 for all other dates.

In [None]:
# Define a function that returns 'Buy' if price below limit, 'Wait' if not.

def buy_bool(price, limit):
    if price < limit:
        return "Buy"
    return "Wait"

In [None]:
# Apply function to OIl Series, args = to specify arguments - make sure to pass a list or tuple to args

oil_series.apply(buy_bool, args=(oil_series.quantile(0.9),))

date
2016-12-20     Buy
2016-12-21    Wait
2016-12-22     Buy
2016-12-23     Buy
2016-12-27     Buy
              ... 
2017-05-09     Buy
2017-05-10     Buy
2017-05-11     Buy
2017-05-12    Wait
2017-05-15     Buy
Name: oil_prices, Length: 100, dtype: object

In [None]:
# Lambda function version of Wait/Buy

oil_series.apply(lambda x: "Buy" if x < oil_series.quantile(0.9) else "Wait")

date
2016-12-20     Buy
2016-12-21    Wait
2016-12-22     Buy
2016-12-23     Buy
2016-12-27     Buy
              ... 
2017-05-09     Buy
2017-05-10     Buy
2017-05-11     Buy
2017-05-12    Wait
2017-05-15     Buy
Name: oil_prices, Length: 100, dtype: object

In [None]:
# Chain Pandas where to specify complementary logic.
# First where - if test returns FALSE (not one of these dates), multiply by 1.1
# Second where - if inverted test returns FALSE (is one of these dates) multiply by .9

(oil_series
 .where(oil_series.index.isin(["2016-12-23", "2017-05-10"]), oil_series * 1.1)
 .where(~oil_series.index.isin(["2016-12-23", "2017-05-10"]), oil_series * .9)
)

date
2016-12-20    57.442
2016-12-21       NaN
2016-12-22    57.178
2016-12-23    46.809
2016-12-27    58.102
               ...  
2017-05-09    50.424
2017-05-10    42.552
2017-05-11    52.591
2017-05-12       NaN
2017-05-15    53.746
Name: oil_prices, Length: 100, dtype: float64

In [None]:
# Use NumPy where to modify price based on dates.
# if price in list, multiply by .9
# if price not in list, multiply by 1.1
# Convert NumPy array returned by np.where to Series
import numpy as np

pd.Series(
    np.where(
        oil_series.index.isin(["2016-12-23", "2017-05-10"]),
        oil_series * 0.9,
        oil_series * 1.1,
    )
)

0     57.442
1        NaN
2     57.178
3     46.809
4     58.102
       ...  
95    50.424
96    42.552
97    52.591
98       NaN
99    53.746
Length: 100, dtype: float64