<!--Course_INFORMATION-->

<img align="left" style="padding-right:10px;" src="../../figures/Atos_Logo_RGB.jpg" width="100" height="50">
*This course contains the introduction to Python and Pandas, and is part of the traineeship Data Science from Atos Data Science University*
*

<!--NAVIGATION-->
< [Previous](3.Merging_Joining_and_Concatenating.ipynb) | [Contents](Index.ipynb) | [Next](5.Reading_from_a_CSV.ipynb) >

Data Structures
pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).
                                                                                                                     

Some remarks on the Markdown:

LaTeX equations, 
Courtesy of MathJax, you can include mathematical expressions both inline: $e^{i\pi} + 1 = 0$ and displayed:
$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

Inline expressions can be added by surrounding the latex code with $: 

$e^{i\pi} + 1 = 0$

Expressions on their own line are surrounded by $$:

$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline
# add some extra information in case Python runs into an exception.
%xmode Verbose

Exception reporting mode: Verbose


<p>Series </p>
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one. <br> <br>
<p>Pandas</p>
In some sense, we can think of a Pandas series as an
extension of a NumPy array, and indeed we can use them to
initialise a series:



In [4]:
# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s

0                7
1       Heisenberg
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

Alternatively, you can specify an index to use when creating the Series.

In [5]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

In [6]:
features = {'limbs':[0,4,4,4,8],'herbivore':['No','No','Yes','Yes','No']}
animals = ['Python', 'Iberian Lynx','Giant Panda', 'Field Mouse', 'Octopus']
df = pd.DataFrame(features, index=animals)

In [7]:
df.head()

Unnamed: 0,limbs,herbivore
Python,0,No
Iberian Lynx,4,No
Giant Panda,4,Yes
Field Mouse,4,Yes
Octopus,8,No


As we mentioned above, we can refer to the column data by
The head method lets us see the
first few rows of a dataframe.
Similarly, tail will show the last
few rows.
the name given to the column. For instance, we can retrieve
the data about the number of limbs of rows 2 through to 4
using the following command:

In [8]:
df['limbs'][2:5]

Giant Panda    4
Field Mouse    4
Octopus        8
Name: limbs, dtype: int64

In [9]:
df.loc['Python']

limbs         0
herbivore    No
Name: Python, dtype: object

If the data is numeric, the describe
method will give us some basic descriptive statistics such as
the count, mean, standard deviation, etc:

In [10]:
df['limbs'].describe()

count    5.000000
mean     4.000000
std      2.828427
min      0.000000
25%      4.000000
50%      4.000000
75%      4.000000
max      8.000000
Name: limbs, dtype: float64

Whereas if the data is categorical it provides a count, the
number of unique entries, the top category, etc.

In [11]:
df['herbivore'].describe()

count      5
unique     2
top       No
freq       3
Name: herbivore, dtype: object

### Slicing

Indexing was only limited to accessing a single element, Slicing on the other hand is accessing a sequence of data inside the list. In other words "slicing" the list.

Slicing is done by defining the index values of the first element and the last element from the parent list that is required in the sliced list. It is written as parentlist[ a : b ] where a,b are the index values from the parent list. If a or b is not defined then the index value is considered to be the first value for a if a is not defined and the last value for b when b is not defined.

In [12]:
s[0:0]

Series([], dtype: object)

The difference between a List and a Series


In [13]:
num = [0,1,2,3,4,5,6,7,8,9]
print(num[0:4])
print(num[4:])

[0, 1, 2, 3]
[4, 5, 6, 7, 8, 9]


In [14]:
print(num[0:4])
print(num[4:])

[0, 1, 2, 3]
[4, 5, 6, 7, 8, 9]


You can also slice a parent list with a fixed length or step length.

In [15]:
print(num[:9:3]) # For the list


[0, 3, 6]


In [16]:
print(s[:6:3]) # for the Series

A              7
Y    -1789710578
dtype: object


Lists can be concatenated by adding, '+' them. The resultant list will contain all the elements of the lists that were added. The resultant list will not be a nested list.

In [17]:
[1,2,3] + [5,4,7]

[1, 2, 3, 5, 4, 7]

Lets see how that works for a Series: it's quite simular.<br>

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index. (Note that the names of the cities are the index


In [18]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities

Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
Austin            450.0
Boston              NaN
dtype: float64

In [None]:
cities['Chicago'] 

In [None]:
cities[['Chicago', 'Portland', 'San Francisco']]

In [19]:
cities[cities < 1000] # or you can use a boolean for  selection

Portland    900.0
Austin      450.0
dtype: float64

That last one might be a little weird, so let's make it more clear - cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

In [20]:
less_than_1000 = cities < 1000
print(less_than_1000)
print('\n')
print(cities[less_than_1000])

Chicago          False
New York         False
Portland          True
San Francisco    False
Austin            True
Boston           False
dtype: bool


Portland    900.0
Austin      450.0
dtype: float64


You can also change the values in a Series on the fly.

In [21]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

Old value: 1000.0
New value: 1400.0


In [22]:
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print(cities[cities < 1000])

Portland    900.0
Austin      450.0
dtype: float64


Portland    750.0
Austin      750.0
dtype: float64


What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

In [23]:
print('Seattle' in cities)
print('San Francisco' in cities)

False
True


Mathematical operations can be done using scalars and functions.

In [24]:
# divide city values by 3
cities / 3

Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
Austin           250.000000
Boston                  NaN
dtype: float64

In [25]:
# square city values
np.square(cities)

Chicago          1960000.0
New York         1690000.0
Portland          562500.0
San Francisco    1210000.0
Austin            562500.0
Boston                 NaN
dtype: float64

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

In [26]:
print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

Chicago     1400.0
New York    1300.0
Portland     750.0
dtype: float64


Austin       750.0
New York    1300.0
dtype: float64


Austin         NaN
Chicago        NaN
New York    2600.0
Portland       NaN
dtype: float64


In [27]:
cities['Chicago'] + cities['New York']

2700.0

In [28]:
cities[['Chicago']]   + cities[['New York']]

Chicago    NaN
New York   NaN
dtype: float64

Adding Data, the theoretical way

In [29]:
b = pd.Series([1], index=['Alabama'])

In [30]:
b

Alabama    1
dtype: int64

In [31]:
cities.add(b)  # investigate the difference between .add and .append

Alabama         NaN
Austin          NaN
Boston          NaN
Chicago         NaN
New York        NaN
Portland        NaN
San Francisco   NaN
dtype: float64

Adding data, the PRACTICAL WAY

In [32]:
cities['Portland e'] = 1

In [33]:
cities

Chicago          1400.0
New York         1300.0
Portland          750.0
San Francisco    1100.0
Austin            750.0
Boston              NaN
Portland e          1.0
dtype: float64

Adding a new column, the practical way, but make it empty, but tracable:

In [34]:
cities['New Column'] = np.NaN

Removing Data the Practical way

In [None]:
cities.drop(['New York']) # lets drop New York

In [None]:
cities  # Let's check if it is really removed!  Ah....it's still there

In [None]:
cities = cities.drop(['New York'])  # The easiest trick to remove New York

In [None]:
cities # Lets have a look

In [None]:
# Lets play with some data:
cities_below_1000 = cities[cities < 1000]

In [None]:
cities.isin(['Boston'])


Or, to recap the Herbivores

In [None]:
df['class']=['reptile','mammal','mammal','mammal','mollusc']
grouped = df['class'].groupby(df['herbivore'])
grouped.groups


In [None]:
df

In [None]:
# exercise: fix this function to return the contents of a cell based on element and index
def myFunction(index, element):  
        #sdfsdf,
        #sdsdf,
        df['limbs'].loc(index)

In [None]:
myFunction('class', 'Python')  # executing this cell should return 'reptile'

In [None]:
[index for element, index  in enumerate(df) if]  # exercise: fix this list comprehension to show all column names of df

<!--NAVIGATION-->
< [Previous](3.Merging_Joining_and_Concatenating.ipynb) | [Contents](Index.ipynb) | [Next](5.Reading_from_a_CSV.ipynb) >