# Series Object

DataFrame objects are like a dictionary full of series objects. 

They are also very similar to the list object in python. As such, the easiest way to create one is to import a list and pass it into a series object. However, series objects differ from python lists in that they must contain objects of the same type. For example, if you were to create a series with mixed data types, then the most common datatype will be used to describe all of the data.

In [1]:
import pandas

fruit = pandas.Series(['Apple', 'Orange', 'Pear', 'Banana'])

print(fruit)

0     Apple
1    Orange
2      Pear
3    Banana
dtype: object


As you can see, when you print out a series object, it gives you a numbered list of all of the objects in the series as well as the most common datatype of that series. 

Simliarly to the dataframe objext, we can actually change the index names of the series object to reflect what the values in the series are. 

For example:

In [3]:

people = pandas. Series(['Steve Jobs', 'Co-Founder of Apple'], index=['Person', 'Description'])

print(people)

Person                  Steve Jobs
Description    Co-Founder of Apple
dtype: object


By using the setting the index option during the creating of the series object, you can specify a list to be used as the names of the rows in the series


## Series Methods

In the last chapter, I talked about some of the methods that could be applied to a series object such as the sum() and mean() methods. But there are many more methods that can be performed on a series object such as:

* append( ) - concatenates two or more series
* describe( ) - Calculates a summary of statistics
* hist( ) - creates a histogram of the data
* min( ) and max( ) - returns the minimum and maximum values of the series
* sort_values( ) - sorts the values in the series
* Along with many more methods

Lets use the sample data from the last chapter to explore these methods.


In [9]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')
group_value = dataframe['Value (kWh)']
group_description = group_value.describe()
print(group_description)

count    432.000000
mean       2.322785
std        0.919252
min        0.145000
25%        1.584000
50%        2.425000
75%        2.861000
max        6.331000
Name: Value (kWh), dtype: float64


As I stated above, the describe methods calculates a summary of statistics including the count, mean, standard deviation, minimum, maximum, and the estimated values at 25%, 50%, and 75%.


In [8]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')
group_value = dataframe['Value (kWh)']
group_sorted = group_value.sort_values()
print(group_sorted.head(15))

103    0.145
121    0.202
222    0.205
175    0.396
127    0.448
155    0.499
200    0.541
125    0.605
131    0.653
132    0.658
225    0.718
101    0.730
119    0.772
220    0.776
153    0.783
Name: Value (kWh), dtype: float64


You can also subset the series by using the methods above.

Here are a couple of examples:


In [3]:
import pandas

dataframe = pandas.read_csv('usage_2018-08-22.csv')
group_value = dataframe['Value (kWh)']
group_mean = group_value.mean()
print("Mean Value = " + str(group_mean) + "\n")
print(group_value[group_value > group_value.mean()].head(15)) # This prints all of the values that are greater than the mean


group_quantile = group_value.quantile(.90)
print("\nValue at 90% = " + str(group_quantile) + "\n")
print(group_value[group_value > group_value.quantile(.90)].head(15)) # This prints all of the values that are in the top 90% 

Mean Value = 2.3227847222222224

0     2.343
12    2.441
13    2.500
14    2.546
15    2.584
16    2.570
17    2.895
18    2.547
19    2.512
20    2.489
21    2.471
22    2.459
35    2.411
36    2.432
37    2.858
Name: Value (kWh), dtype: float64

Value at 90% = 3.515200000000002

43     4.261
45     3.552
58     3.641
69     3.585
140    3.849
141    3.859
165    3.946
168    4.617
178    3.879
185    3.637
186    4.109
229    3.760
260    4.360
261    3.594
274    4.458
Name: Value (kWh), dtype: float64




# Dataframe Object

As I said above, the Dataframe object is just a dictionary of multiple series objects. And in the same way the we create series objects, we can also create dataframe objects. To create a dataframe, we can simply pass in a dictionary of series objects into the dataframe.

To create a dictionary, first you specify the name of the column, followed by a colon, and then the list of all of the data in that column. This will create one series object. To continue create more columns, simply repeat this process separating each name:value pair with commas.

In [5]:

products = pandas.DataFrame({
    'Name' : ['Apple', 'Orange', 'Pear'],
    'Price' : [1.00, 1.50, 2.00],
    'Exp. Date' : ['11/1', '11/5', '10/30']
})

print(products)

     Name  Price Exp. Date
0   Apple    1.0      11/1
1  Orange    1.5      11/5
2    Pear    2.0     10/30


The order in which you add things to your dataframe object are not guarenteed to appear in that odrer when it is created. A lot of times when you are running different anaylses on your data, you need to know which data is in which column, and if you want to automate that task, then that data always needs to be in the same postition. 

To enure that the data you enter stays in the same order, you need to set the columns options with a list of the names of the columns in the order in which you want them to appear. 

For example, if I want to reorder the above dataframe but start with the Exp. Date, then the price, and lastly the name, I should do the following:


In [10]:
products = pandas.DataFrame({
    'Name' : ['Apple', 'Orange', 'Pear'],
    'Price' : [1.00, 1.50, 2.00],
    'Exp. Date' : ['11/1', '11/5', '10/30']},
    columns=['Exp. Date', 'Price', 'Name']
)

print(type(products))

print(products)

<class 'pandas.core.frame.DataFrame'>
  Exp. Date  Price    Name
0      11/1    1.0   Apple
1      11/5    1.5  Orange
2     10/30    2.0    Pear


Just as you saw in series objects, you can select certain rows of a dataframe object using boolean statements, such as selecting all of the rows where the value of the price of the fruit is greater than the average price. 


In [8]:
products = pandas.DataFrame({
    'Name' : ['Apple', 'Orange', 'Pear', 'Banana', 'Grape', 'Kiwi'],
    'Price' : [1.00, 1.50, 2.00, .50, 1.75, 3.00],
    'Exp. Date' : ['11/1', '11/5', '10/30', '11/17', '11/10', '11/9']},
    columns=['Exp. Date', 'Price', 'Name']
)

print(str(type(products))+"\n")

print("The average price is: "+str(products['Price'].mean())+"\n")

print(products[products['Price'] > products['Price'].mean()])

<class 'pandas.core.frame.DataFrame'>

The average price is: 1.625

  Exp. Date  Price   Name
2     10/30   2.00   Pear
4     11/10   1.75  Grape
5      11/9   3.00   Kiwi


This is just one of the ways that you can subset data from a dataframe object. 
Other ways include:

* df[colunm_name] - Selects a single column
* df[column_one, column_two,...] - Selects multiple columns
* df.loc[row_index] - Selects a single row
* df.loc[row_one, row_two, ...] - Selects multiple rows
* df.iloc - Similar to loc, but selects by row name, rather than row index
* df[bool] - Selects a row based on a boolean value
* df[bool_one, bool_two, ...] - Selects multiple rows based on boolean values
* df[start,stop,step] - Selects rows based using the slicing methods of other python objects

You can also combine the subsetting methods for both row and columns to select single pieces of information from your dataframe object. 


In [20]:
products = pandas.DataFrame({
    'Name' : ['Apple', 'Orange', 'Pear', 'Banana', 'Grape', 'Kiwi'],
    'Price' : [1.00, 1.50, 2.00, .50, 1.75, 3.00],
    'Exp. Date' : ['11/1', '11/5', '10/30', '11/17', '11/10', '11/9']},
    columns=['Exp. Date', 'Price', 'Name']
)

print(str(type(products))+"\n")

print(products)

print(products['Price'].loc[1]) # Will print out the price of the second row in the dataframe

<class 'pandas.core.frame.DataFrame'>

  Exp. Date  Price    Name
0      11/1   1.00   Apple
1      11/5   1.50  Orange
2     10/30   2.00    Pear
3     11/17   0.50  Banana
4     11/10   1.75   Grape
5      11/9   3.00    Kiwi
1.5


# Changing our Data

The Pandas library also allows use to make changes to our data. For example, maybe when we were creating our own data, we forgot to add another field to the dataframe. 

Currently our fruits dataframe looks like this:

* Exp. Date  Price    Name
* 0      11/1   1.00   Apple
* 1      11/5   1.50  Orange
* 2     10/30   2.00    Pear
* 3     11/17   0.50  Banana
* 4     11/10   1.75   Grape
* 5      11/9   3.00    Kiwi

But how would we add another column that displays the ilse number in the store where you can find that fruit?
To add a column, you simply create a list of the new values you want to add, and then set your new column in the dataframe to that list.


In [21]:
isle_numbers = [4, 10, 11, 4, 6, 8] # The new list of isle numbers to be added

products['Isle'] = isle_numbers # Even though products['Isle'] doesnt exist, by setting it equal to the list we created, it then adds that new column to the dataframe

print(products)

  Exp. Date  Price    Name  Isle
0      11/1   1.00   Apple     4
1      11/5   1.50  Orange    10
2     10/30   2.00    Pear    11
3     11/17   0.50  Banana     4
4     11/10   1.75   Grape     6
5      11/9   3.00    Kiwi     8


## Deleting Data

What if there is a column in your data that is completely useless and you want to delete it? Well, pandas also provides you with a way to do that.

Dataframes in pandas come with the method drop( ), which takes the name of the column you wish to drop and returns a new dataframe without the specified column. The drop method can drop both columns and rows, which is why you must also specify the axis agrument. Set axis to 1 to drop columns wise, and set axis to two to drop row wise. 

Note that the drop method does not affect the dataframe it is used on. It simply creates a new dataframe without the specified column


In [22]:
# Lets delete the Isle column that we just create as well as the Kiwi row

products_drop_col = products.drop(['Isle'], axis=1) # Drops the Isle column

print(products_drop_col)
print("\n")

products_drop_row = products_drop_col.drop([5], axis=0) # Drops the 6th indexed row

print(products_drop_row)



  Exp. Date  Price    Name
0      11/1   1.00   Apple
1      11/5   1.50  Orange
2     10/30   2.00    Pear
3     11/17   0.50  Banana
4     11/10   1.75   Grape
5      11/9   3.00    Kiwi


  Exp. Date  Price    Name
0      11/1   1.00   Apple
1      11/5   1.50  Orange
2     10/30   2.00    Pear
3     11/17   0.50  Banana
4     11/10   1.75   Grape


# Saving our Data

In the previous chapter, we imported data to perform analyses on. But what is we made changes to our datasets and wanted to save those changes? The Pandas library also allows you do do this.

One way of doing this is to pickle the data. Pickle data is python's way of serializing and saving the data in binary format. It is called by using the to_pickle( ) method, where you pass on the path of where you want to save the data.
So lets try an save our product data by pickling it.


In [23]:
products.to_pickle('products_df.pickle') # Since I want to save this in my current directory, I can simly just put the name of the file as the path

In [25]:
# How do we know if the pickle worked?
# We can alse read in pickle data using the read_pickle() method

products_from_pickle = pandas.read_pickle('products_df.pickle')

print(products_from_pickle)

  Exp. Date  Price    Name  Isle
0      11/1   1.00   Apple     4
1      11/5   1.50  Orange    10
2     10/30   2.00    Pear    11
3     11/17   0.50  Banana     4
4     11/10   1.75   Grape     6
5      11/9   3.00    Kiwi     8


Pickling out data is a great way for us to save and access our data, but its not human readable as it is stored in a binary format. So what about when we want to examine our data not using Python or Pandas? Luckily pickling our data is not our only way of saving it. 

As we saw in the first chapter, we loaded in data using the read_csv( ) function which loaded in a comma separated values list and stores it as a dataframe. Pandas also provides us with the ability to save it as a csv using the function to_csv( ), which takes an argument for the path to the file you want to save.

In [26]:
products.to_csv('products_df.csv')

In [27]:
# Again, how do know if this worked?
# Lets read in the csv file we just created with the read_csv() function

products_from_csv = pandas.read_csv('products_df.csv')

print(products_from_csv)

   Unnamed: 0 Exp. Date  Price    Name  Isle
0           0      11/1   1.00   Apple     4
1           1      11/5   1.50  Orange    10
2           2     10/30   2.00    Pear    11
3           3     11/17   0.50  Banana     4
4           4     11/10   1.75   Grape     6
5           5      11/9   3.00    Kiwi     8


# Exercise

First, I want you to create a new dataframe object that has information about other students in the class. You should store at least 10 students and keep track of their name, age, standing(freshman, sophmore, ...), major, and grade in the class. I want the name to be used as the row name and for those attributes to be in the that order.

Then, I want a list of all of the students who have an A in the class, and another list of all the students whose age is smaller than the average.

Then, say student number 4 decided to drop the class, which also made studnet number 8 drop as well. I want you to delete those students, save the remaining students data into a csv file, and then read back in that file to show that it saved with the deleted students.

In [40]:
students = pandas.DataFrame({
    'Age' : [18,18,19,23,22,18,28,20,21,24],
    'Standing' : ['Freshman', 'Freshamn', 'Sophmore', 'Senior', 'Senior', 'Freshman', 'Junior', 'Sophmore', 'Junior', 'Senior'],
    'Major' : ['CITE','COSC','POSC','MATH','CITE','PHYS','COSC','CITE','ANTH','SOCI'],
    'Grade' : [74, 88, 91, 96, 78, 82, 90, 79, 89, 85]},
    index=['Student 1','Student 2','Student 3','Student 4','Student 5','Student 6','Student 7','Student 8','Student 9','Student 10'],
    columns=['Age', 'Standing', 'Major', 'Grade']
)

print(students)

print("\nThe students who have an A in the class are: \n")

print(students[students['Grade'] > 90])

print("\nThe average age of students is: " + str(students['Age'].mean())+"\n")

print(students[students['Age'] < students['Age'].mean()])

print("\n")

students_dropped = students.drop(["Student 4", "Student 8"], axis=0)

students_dropped.to_csv('Students_df.csv')

students_from_csv = pandas.read_csv('Students_df.csv')

print(students_from_csv)


            Age  Standing Major  Grade
Student 1    18  Freshman  CITE     74
Student 2    18  Freshamn  COSC     88
Student 3    19  Sophmore  POSC     91
Student 4    23    Senior  MATH     96
Student 5    22    Senior  CITE     78
Student 6    18  Freshman  PHYS     82
Student 7    28    Junior  COSC     90
Student 8    20  Sophmore  CITE     79
Student 9    21    Junior  ANTH     89
Student 10   24    Senior  SOCI     85

The students who have an A in the class are: 

           Age  Standing Major  Grade
Student 3   19  Sophmore  POSC     91
Student 4   23    Senior  MATH     96

The average age of students is: 21.1

           Age  Standing Major  Grade
Student 1   18  Freshman  CITE     74
Student 2   18  Freshamn  COSC     88
Student 3   19  Sophmore  POSC     91
Student 6   18  Freshman  PHYS     82
Student 8   20  Sophmore  CITE     79
Student 9   21    Junior  ANTH     89


   Unnamed: 0  Age  Standing Major  Grade
0   Student 1   18  Freshman  CITE     74
1   Student 2   18