### Series

Create a series of 10 elements, random integers from 70-100, representing scores on a monthly exam. 

Set the index to be the month names, starting in September and ending in June. (If these
months don’t match the school year in your location, then feel free to make them more realistic.) With this series, answer the following questions:

- What is the student’s average test score for the entire year?
- What is the student’s average test score during the first half of the year (i.e., the first five
months)?
- What is the student’s average test score during the second half of the year?
- Did the student improve their performance in the second half? If so, then by how much?

In [None]:
import pandas as pd
import numpy as np

In [5]:
np.random.seed(0)

# Months List as index
months = 'Sep Oct Nov Dec Jan Feb Mar Apr May Jun'.split()

# Creating a Series of 10 elements 
scores = pd.Series(np.random.randint(70, 100, 10), index=months)

print(f'Yearly Average --> {scores.mean()}')

# First Half Average
first_half_avg = scores['Sep':'Jan'].mean()

# Second Half Average
second_half_avg = scores['Feb':'Jun'].mean()

print(f'First half average --> {first_half_avg}')
print(f'Second Half Average --> {second_half_avg}')
print(f'Improvement --> {second_half_avg - first_half_avg:.2f}')

Yearly Average --> 81.6
First half average --> 80.2
Second Half Average --> 83.0
Improvement --> 2.80


Retrieving both individual elements and slices from series is a critical skill when working with pandas.
 
- In which month did this student get their highest score? Note that there are at least two ways to accomplish this: You can sort the values, taking the largest one, or you can use a boolean ("mask") index to find those rows that match the value of s.max(), the highest value.
- What were this student’s five highest scores in the year?
Round the student’s scores to the nearest 10. So a score of 82 would be rounded down to 80, but a score of 87 would be rounded up to 90.

In [14]:
# Finding the month in which highest score occured using boolean filtering
print(f'Heighest Score --> {scores[scores.sort_values(ascending=False) == scores.max()]}')

Heighest Score --> Jun    97
dtype: int32


In [20]:
# Selecting the top 5 scores and rounding them
def round_to_tenth(number):
    return round(number, ndigits=-1)


scores.sort_values(ascending=False)[0:5].apply(round_to_tenth)

Jun    100
May     90
Apr     90
Mar     80
Feb     80
dtype: int64

For this exercise, I want you to generate 10 test scores between 40 and 60, again using an index
starting at Jan and ending with Dec. Find the mean of the scores, and add the difference
between the mean and 85 to each of the scores.

In [23]:
months = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec'.split()

student_scores = pd.Series(np.random.randint(40, 60, 12), index=months)

# Find the mean and and add difference b/n mean and 85 to the scores
diff = 85 - student_scores.mean()

new_student_scores = student_scores + diff

new_student_scores.head()

Jan    94.75
Feb    94.75
Mar    89.75
Apr    82.75
May    75.75
dtype: float64

In this exercise, I want you to generate 10 random integers in the range 0 - 100. (Remember that
the function returns numbers that include the lower bound, np.random.randint but exclude the
upper bound.) Create a series containing those numbers' 10s digits. Thus, if our series contains
10, 20, 30, we want a series with 1, 2, 3.

In [27]:
rand_series = pd.Series(np.random.randint(0, 100, 10))

#Turn our series back into np.int8. 
# This has the result of removing 
# the fractional part of the number.
(rand_series/10).astype(np.int8)

0    8
1    7
2    6
3    0
4    6
5    4
6    0
7    7
8    5
9    7
dtype: int8

In [48]:
# Create a new series, with 10 floating-point values between 0 and 1,000. Find the numbers
# whose integer component (i.e., ignoring any fractional part) are even

import math

float_df = pd.Series(np.random.random(size=10) * 1000)

float_df

0    660.173537
1    290.077607
2    618.015429
3    428.768701
4    135.474064
5    298.282326
6    569.964911
7    590.872761
8    574.325249
9    653.200820
dtype: float64

In [51]:
float_df[float_df.map(math.floor) % 2 == 0]
 

0    660.173537
1    290.077607
2    618.015429
3    428.768701
5    298.282326
7    590.872761
8    574.325249
dtype: float64

### Working with Boolean Indexes

In Python and other traditional programming languages, we can select elements from a sequence using a combination of for loops and if statements. 

While you could do that in pandas, you almost certainly don’t
want to. Instead, you want to select items using a combination of techniques known as a "boolean index" or a "mask index."

In [28]:
new_series = pd.Series([10, 20, 30, 40, 50])

# To retrieve '40' from this list, lets use the list index

new_series[3]

40

But instead of passing in a single integer, I can alse pass a list (or NumPy array or series) of boolean values.



In [29]:
new_series[[False, False, False, True, False]]

3    40
dtype: int64

Notice the double square brackets! The outer pair indicates we want to retrieve from s. The inner pair defines a Python list.
Returns a series containing 10, 20, and 50.

Notice that the list we used was of the same length as s, and that wherever we passed a True value, the value from s was returned. That’s why this is called a "mask index," because we’re using the list of booleans as a type of sieve, or mask, to select only certain elements.


An explicitly defined list of booleans isn’t very useful or common. But we can also use a series of booleans—and those are easy to create. All we need to do is use a comparison operator (e.g., ==) which returns a boolean value.
Then we can broadcast the operation, and get a series back.

In [37]:
# Creating a series of booleans with a comparison operator

new_series < 35

pandas.core.series.Series

In [35]:
# Now you can filter by this list
new_series[new_series < 35]

0    10
1    20
2    30
dtype: int64

In [38]:
# Lets make this more sophisticated

new_series[new_series <= new_series.mean()]

0    10
1    20
2    30
dtype: int64

Now s appears three times in the expression: Once when we calculate
s.mean(), a second when we compare the mean with each element of s via broadcast, and a third when we apply the resulting boolean series to s. 
We can thus see all of the elements that are less than or equal to the mean.

Finally, we can use a mask index for assignment, as well as retrieval. For example:

In [39]:
new_series[new_series <= new_series.mean()] = 99999

new_series

0    99999
1    99999
2    99999
3       40
4       50
dtype: int64

### Descriptive Statistics

- Generate a series of 100,000 floats in a normal distribution, with a mean at 0 and a
standard deviation of 100.
- Get the descriptive statistics for this series. How close are the mean and median?
- Replace the minimum value with 5 times the maximum value.
- Get the descriptive statistics again. By how much did the mean, median, and standard
deviations move, and why?

In [52]:
# Descriptive Statistics

np.random.seed(0)

norm_series = pd.Series(np.random.normal(0, 100, 100000))

# Lets get the decriptive Statistics
norm_series.describe()

count    100000.000000
mean          0.157670
std          99.734467
min        -485.211765
25%         -66.864170
50%           0.172022
75%          67.343870
max         424.177191
dtype: float64

In [58]:
# Now, replace the min value with 4 times the max value

norm_series[norm_series == norm_series.min()] = 5 * norm_series.max() 

norm_series.describe()

count    1.000000e+05
mean     1.674372e+01
std      4.279356e+03
min     -3.900025e+02
25%     -6.685246e+01
50%      1.809485e-01
75%      6.735914e+01
max      1.325554e+06
dtype: float64

First, the mean value has gone up by a bit—which which makes sense, given that we took a
small value and made it much larger. That’s why the mean, while valuable, is sensitive to even a handful of very large or very small values.

Second, we see that the standard deviation has also gone up. Once again, this makes a great deal of sense, given that we have made a single value that’s much larger than anything we had before.
True, the standard deviation didn’t change by that much, but it does reflect the fact that values in our series are now spread out by more than before.

Finally, the median barely shifted. That’s because it tends to be the most stable measurement, even when we have fluctuations at the extremes. This doesn’t mean that you should always look at the mean, but it can be useful. For example, if a country is trying to determine the threshold for government-sponsored benefits, a small number of very rich people would skew the mean upward, thus depriving more people of receiving that help. The median would allow us to say that (for example) the bottom 20% of earners will receive help.

It’s common to assume that the index in a pandas series is unique. After all, the index in a Python string, list, or tuple is unique, as are the keys in a Python dictionary. 

But it turns out that a series index can contain repeated values. This turns out to be quite useful in many ways.

In [69]:
# Create the Days Index
days = 'Sun Mon Tue Wed Thu Fri Sat'.split()

# Creating a Series of Integers 
weather_series = pd.Series(np.random.normal(20, 5, 28), index=days * 4).round().astype(dtype=np.int16)


# Now, find the mean temp on Mondays
weather_series.loc['Mon'].mean()

20.5

First, we need to create a series which contains 28 elements, but
with a repeating index. Let’s start by creating a random NumPy array of 28 elements, drawn from a normal distribution, in which the mean is 20 and the standard deviation is 5.

Secondly, if I had only seven data points in my series, then I could set the index with index=days inside of the call to Series. But because we have 28 data points, I want my list to repeat itself. I can actually create such a 28-element list by multiplying my list by 4, as in days * 4.

This means that when you retrieve s.loc[i], for a given index value, you
can’t know in advance whether you will get a single, scalar value (if the index
occurs only once) or a series (if the index occurs multiple times). This is
another case in which you need to know your data, to know what type of value
you’ll get back.

- What was the average temperature on weekends (i.e., Saturdays and Sundays)?

- How many times will the change in temperature from the previous day be greater than 2
degrees?

- What are the two most common temperatures in our data set, and how often does each
appear?

In [71]:
# Average Temperature on weekends

sat_avg_temp = weather_series.loc['Sat'].mean()

sun_avg_temp = weather_series.loc['Sun'].mean()

print(f'The average temperatures are:\nSaturday --> {sat_avg_temp}\nSunday --> {sun_avg_temp}')

The average temperatures are:
Saturday --> 24.0
Sunday --> 25.5


In [77]:
# How many times will the change in temperature from the previous 
# day be greater than 2 degrees?

weather_series[weather_series.diff() > 2].count()

12

### Reading CSV Files

The data comes from 2015 data I retrieved from New York City’s open data site, from which you can get enormous amounts of information about taxi rides in New York city over the last few years. This file shows the number of passengers in each of 100,000 rides.
Your task in this exercise is to show what percentage of taxi rides had only 1 passengers, vs. the maximum of 6 passengers.

In [78]:
# What are the two most common temperatures in our data set, and how often does each
# appear?

weather_series.mode()

0    24
dtype: int16

In [80]:
taxi_pass_series = pd.read_csv('data/taxi-passenger-count.csv', squeeze=True, header=None)

taxi_pass_series.head()

0    1
1    1
2    1
3    1
4    1
Name: 0, dtype: int64

Let’s start with reading the data into our series. read_csv is one of the most powerful and commonly used functions in pandas, reading a CSV file (or anything resembling a CSV file) into a data structure. As I mentioned above, read_csv is more typically used to create a dataframe—but if we provide it with a file that contains only one data point per line, and pass a True value to the squeeze parameter, then we’ll get a series back. Because all of the values in this file are integers, pandas assumes that we want the series dtype to be np.int64.

I also set the header parameter to be None, indicating that the first line in the file should not be taken as a column name, but rather is data to be included in our calculations. This will result in the series having a name value of 0, which we can safely ignore.

### Value Counts

Value_counts, a series method that is one of my favorites. If you apply value_counts to the series s, you get back a new series whose keys are the distinct values in s, and whose values are integers indicating how often each value appeared

In [82]:
taxi_pass_series.value_counts().loc[[1,6]]

1    7207
6     369
Name: 0, dtype: int64

Because we get a series back from value_counts, we can use all of our series tricks on it. For example, we can invoke head on it, to get the five most common elements. We can also use fancy indexing, in order to retrieve the counts for specific values.

But we’re actually interested in the percentages, not in the raw values. Fortunately,
has an optional parameter, that if set to returns value_counts normalize True the fraction

In [83]:
taxi_pass_series.value_counts(normalize=True).loc[[1,6]]

1    0.720772
6    0.036904
Name: 0, dtype: float64

In this exercise, we’re once again going to look at taxi data—but instead of looking at the number of passengers, we’re instead going to look at the distance (in miles) that each taxi ride went.

- short, 2 miles
- medium, > 2 miles, but 10 miles
- long, > 10 miles

In [84]:
taxi_dist_series = pd.read_csv('data/taxi-distance.csv', squeeze=True, header=None)

taxi_dist_series.head()

0    1.63
1    0.46
2    0.87
3    2.13
4    1.40
Name: 0, dtype: float64

In [86]:
pd.cut(taxi_dist_series, bins=[taxi_dist_series.min(), 2, 20, taxi_dist_series.max()],
       labels=['short', 'medium', 'long']).value_counts()

short     5823
medium    4034
long        75
Name: 0, dtype: int64

The **pd.cut** method allows us to set numeric boundaries, and then to cut a series into parts (known as "bins") based on those boundaries. Moreover, it can assign labels to each of the bins.
Notice that pd.cut is not a series method, but rather a function in the top-level pd namespace. We’ll pass it a few arguments:
- our series, *taxi_dist_series* 

- a list of four integers representing the boundaries of our three bins, assigned to the bins parameter

- a list of three strings, the labels for our three bins, assigned to the labels parameter

Note that the bin boundaries are exclusive on the left, and inclusive on the right. In other words, by specifying that the "medium" bin is between 2 and 10, that means >2 but 10. 