In [None]:
import numpy as np
import pandas as pd

# Series

A pandas `Series` is a 1D labelled array (like a "list with labels"). A `Series` is heterogeneous and is capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the *index*.

In [None]:
person = pd.Series([180, 80.5, 'male'], index=['height', 'weight', 'gender'])
person

A pandas `Series` acts very similarly to a NumPy array and supports many of the same functions.

In [None]:
person[0]

In [None]:
person[:2]

In [None]:
person[[0, 2]]

In [None]:
person_stats = person[[0, 1]]
print(person_stats)
person_stats[person_stats > 100]

A `Series` is also a bit like a dict, in that you can use its labels as indices...

In [None]:
person['height'] = 170
person

In [None]:
person[['height', 'weight']]

In [None]:
print(person.get('height', np.nan))
# nan = "Not a Number"
print(person.get('age', np.nan))
# Uncomment to see what happens
#person['age']

Basic NumPy operations on `Series` work as you'd expect...

In [None]:
person_stats * 2

In [None]:
person_stats + person_stats

You can also initialise a pandas Series using a dictionary.

In [None]:
person_stats2 = pd.Series({'height': 140, 'weight': 40})
person_stats2

In [None]:
person_stats2.index

In [None]:
person_stats2.values

In [None]:
person_stats3 = person_stats2[::-1]
person_stats3

In [None]:
person_stats

Even though `person_stats3` is now "backwards", when we add `person_stats` to `person_stats3` the labels `height` and `weight` will be automatically aligned to make the computation correct. This is a huge difference from simply using numerical indices for alignment.

In [None]:
person_stats + person_stats3

## Exercises

1. Create a pandas Series to represent a single car. Include its mileage, its year of manufacture and its model.

2. Print the first element in the car Series.

3. Print the first and last element in the car Series (with a single line of code).

4. Print the car's year of manufacture using a label as the index.

5. Print the car's mileage and its year of manufacture (with a single line of code).

6. Change the car's mileage to a different value.

7. Double the car's mileage.

8. Create another Series by extracting the car's year of manufacture and its mileage.

9. Create a Series to represent a second car, this time using a dictionary for the initialisation.

10. Add together the year of manufacture and mileage of your two cars.

## Data Frames

`DataFrame`s are the main structures used to store data sets, with columns corresponding to variables and rows corresponding to observations.

They are essentially matrices containing heterogeneous data and with labelled rows and columns. You can think of a data frame as being a bit like a spreadsheet.

In [None]:
# 3 arguments
# 
my_df = pd.DataFrame(([[19,165,'female'], [19,177,'male'], [72,154,'female']]),
                     index=['Observation 0', 'Observation 1', 'Observation 2'],
                     columns=['age', 'height', 'gender'])
my_df

In [None]:
# Let's do some data frame introspection...
my_df.index

In [None]:
my_df.columns

In [None]:
len(my_df)

In [None]:
my_df.describe()

There are a huge number of different ways of indexing into a data frame!

In [None]:
# You can access a specific column's values using the dot . notation
my_df.age 

In [None]:
# Or use the [] syntax to access a column's values
my_df['age']

In [None]:
# Using [] also allows you to access multiple columns at once. Notice that we need to pass
# in a list of columns.
my_df[['age', 'height']]

In [None]:
# With the loc() method, we can select rows *and* columns based on their names
my_df.loc['Observation 0']

In [None]:
# Here, we select the rows labelled 'Obs 1' and 'Obs 2' and the columns
# labelled 'age' and 'gender'
my_df.loc[['Observation 1', 'Observation 2'], ['age', 'gender']]

In [None]:
# The iloc() method allows us to select elements by positional index instead of labels
my_df.iloc[0]

In [None]:
my_df.iloc[0:2]

In [None]:
# Here, we select rows 0:2 and columns 0,2
my_df.iloc[0:2, [0, 2]]

In [None]:
# Just as for a Series, we can create a DataFrame using a dict

my_df = pd.DataFrame({'age': np.array([19, 34, 72, 21, 14, 55]),
                      'height': np.array([165, 177, 154, 161, 133, 188]),
                      'gender': np.array(['female', 'male', 'female', 'female', 'female', 'male'])},
                     index=['Observation {0}'.format(i) for i in range(6)])
my_df

In [None]:
# Let's rearrange those columns
my_df = my_df[['age', 'height', 'gender']]
my_df

In [None]:
# If you have a large data set, use head() or tail() to view only the first or last few
# observations
my_df.head(2)

In [None]:
my_df.tail(2)

In [None]:
# Let's look at the oldest people first. This isn't an *in-place* sort.
my_df.sort_values('age', ascending=False)

In [None]:
# Now it's an in-place sort!
my_df.sort_values('age', ascending=False, inplace=True)
my_df

In [None]:
# Let's get those observations back in order. To do this, we can use sort_index()
# to sort by the observation name.
my_df.sort_index(ascending=True, inplace=True)
my_df

In [None]:
# Here's how to take a random sample...
my_df.sample(3)

In [None]:
# Let's extract everyone older than 50
my_df[my_df.age > 50]

In [None]:
# Let's extract all males younger than 50
my_df[(my_df.age < 50) & (my_df.gender == 'male')]

In [None]:
my_df

In [None]:
# Let's convert 'gender' to a numerical value
def gender_to_numeric(gender):
    if gender == 'male':
        return 0
    else:
        return 1
    
my_df['gender_num'] = my_df['gender'].apply(gender_to_numeric)
my_df

In [None]:
# Let's get rid of the gender_num column
# Axis 0 refers to rows
# Axis 1 refers to columns
my_df.drop('gender_num', axis=1, inplace=True)
my_df

## Exercises

1. Create a `car_df` DataFrame containing 5 different cars. Include each car's year of manufacture, its mileage and whether it is an automatic.

2. Print the values in the mileage column.

3. Print the values in the mileage and year_of_manufacture column (using a single line of code).

4. Print the first observation using a positional index.

5. Print the first observation using a labelled index.

6. Print the first three observations using a positional index.

7. Print the mileage and is_automatic values for the first and third observations (using a single line of code). Do this using both positional and labelled indices.

8. Print summary statistics for the data frame.

9. Print the head and tail of the data frame.

10. Print a random sample of 3 elements from the data frame.

11. Print all cars with a mileage above 30,000 miles.

12. Print all automatic cars with a mileage above 30,000 miles.

13. Print the cars in descending order of mileage (i.e. with the highest-mileage cars first).

14. Convert is_automatic to a numerical value and put these values in a new column.

15. Delete the new column you created.