<h2> ======================================================</h2>
 <h1>MA477 - Theory and Applications of Data Science</h1> 
  <h1>Lesson 3: Manipulating Data with Pandas</h1> 
 
 <h4>Dr. Valmir Bucaj</h4>
 United States Military Academy, West Point 
AY20-2
<h2>=======================================================</h2>

Pandas is a modern package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.
``DataFrame``s are essentially multidimensional arrays with attached row and column labels,that allow for different data types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

In this lesson, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the main focus, but will rather serve as means of illustrating the use of the Pandas library.

Pandas just as NumPy comes already installed with Anaconda. If you are using a desktop IDE then you may first need to install it from the command line using:

```python
pip install pandas
```

You may also install it from the conda shell using: 
```python 
conda install pandas
```
To use Pandas you first need to import it into the JupyterNotebook. It is customary to import it using `pd` as an alias.

In [4]:
import pandas as pd

In [5]:
import numpy as np

<h2>Lecture Outline</h2>

<ul>   
 <li> <b> Series</b></li>
 <li><b>DataFrames </b></li>
 <li><b>Missing Data</b></li>
 <li><b>GroupBy</b></li>
 <li><b>Merging, Joining, Concatenating</b></li>
 <li> <b>Operations with Pandas</b></li>
 <li><b> Importing and Exporting Data</b></li>
  <hr style="height:2px;border:none;color:#333;background-color:#333;" />
 <ul>
 


<h2> Series </h2>

A Pandas `series` is build off on top of a NumPy array. However, unlike NumPy arrays, Pandas `series` can have axis labels, that is they can be indexed by a label, unlike Numpy arrays.      

<h3> Creating Series from Python Objects (Lists, Arrays, Dictionaries)</h3>

```python
labels=['a','b','c','d']
my_data=[10,20,30,40]
my_arr=np.array(my_data)
my_dict={'x':11,'y':22,'z':33,'w':44}
```
<b>Creating a Pandas Series from a Python List:</b>

```python
pd.Series(data=my_data)
#Result
0    10
1    20
2    30
3    40
dtype: int64
 ```
 
Specifying the index labels:

```python
pd.Series(data=my_data,index=labels)
#Result
a    10
b    20
c    30
d    40
dtype: int64
    ```
    
Unlike NumPy arrays, we can call the data in a Pandas `series` using the index labels. We will demonstrate that shortly.

<b>Creating a Pandas Series from a NumPy Array:</b>

```python
pd.Series(my_arr)
#Result
0    10
1    20
2    30
3    40
dtype: int32
    
pd.Series(my_arr,lables)
#Result
a    10
b    20
c    30
d    40
dtype: int32
    ```
    
<b>Creating a Pandas Series from a Python Dictionary:</b>

```python
pd.Series(my_dict)
#Result
x    11
y    22
z    33
w    44
dtype: int64
 ```
 
A Series can hold any type of data objects not just `int` and `float`:

```python
my_data=['Valmir',print, 'Humphries',sum,'Ronasia',44,100,'Joshua',3.14]

pd.Series(my_data)
#Result
0                       Valmir
1    <built-in function print>
2                    Humphries
3      <built-in function sum>
4                      Ronasia
5                           44
6                          100
7                       Joshua
8                         3.14
dtype: object
```

<h3> Accessing Data in a Pandas Series</h3>

Let's create the following two series:

```python
ser1=pd.Series(['Valmir','Humphries','Ronasia',44,100,'Joshua',3.14])
#Result
ser1
0       Valmir
1    Humphries
2      Ronasia
3           44
4          100
5       Joshua
6         3.14
dtype: object
    
ser2=pd.Series(data=[2005,2016,1976,1997],index=['F-22A','F-35A','F-15C','B-2A'])
#Result
ser2
F-22A    2005
F-35A    2016
F-15C    1976
B-2A     1997
dtype: int64
 ```
 
 If we want to access `Ronasia` from the first series and the year when the `F-35A` was introduced, we may do so as follows:
 
 ```python
ser1[2]
#Result
Ronasia

ser2['F-35A']
#Result
2016

```

<h2> Basic Operations with Series</h2>

Take the following two series:
```python
ser2=pd.Series(data=[2005,2016,1976,1997],index=['F-22A','F-35A','F-15C','B-2A'])
#Result
ser2
F-22A    2005
F-35A    2016
F-15C    1976
B-2A     1997
dtype: int64
    
ser3=pd.Series(data=[2005,2016,1954,1997],index=['F-22A','F-35A','HC-130P','B-2A'])
#Result
ser3 
F-22A      2005
F-35A      2016
HC-130P    1954
B-2A       1997
dtype: int64
```

Most operations with `series` happen off their index. For example, if we try to add `ser2` and `ser3` toghether, it will try to match them up according to their index and add correspondingly. If it can't find a match in both series, it will put a `NaN` there:

```python
ser2+ser3
#Result
B-2A       3994.0
F-15C         NaN
F-22A      4010.0
F-35A      4032.0
HC-130P       NaN
dtype: float64
```
<hr style="height:1px;border:none;color:#333;background-color:#333;" />


<h2><font color='red'>Practice Exercise</font></h2>

Create the following Pandas `series`:

```python
#Result
CA    423.0
TX    695.0
NY    141.0
FL    170.0
IL    149.0
dtype: float64
    ```


In [108]:
#Enter your code here


<hr style="height:1.5px;border:none;color:#333;background-color:#333;" />
<h2>DataFrames</h2>

`DataFrames` will be the data structures we will mostly use when working with Pandas.

Let's begin by importing the normal distributin from `NumPy`. We will also set a `seed` so that we all get the same random numbers.

In [72]:
from numpy.random import randn
np.random.seed(42)

<h3> Creating DataFrames</h3>

The basic command for creating DataFrames is:

```python
pd.DataFrame(data=None,index=None,columns=None,dtype=None)
```

Let's create a `DataFrame`:

In [80]:
pd.DataFrame(data=randn(7,4),columns=['A','B','C','D'])

Unnamed: 0,A,B,C,D
0,-0.808494,-0.501757,0.915402,0.328751
1,-0.52976,0.513267,0.097078,0.968645
2,-0.702053,-0.327662,-0.392108,-1.463515
3,0.29612,0.261055,0.005113,-0.234587
4,-1.415371,-0.420645,-0.342715,-0.802277
5,-0.161286,0.404051,1.886186,0.174578
6,0.25755,-0.074446,-1.918771,-0.026514


In some cases we may want to also set a customized index label:

In [267]:
df=pd.DataFrame(randn(7,4),index=['a','b','c','d','e','f','g'],columns=['A','B','C','D'])

In [268]:
df

Unnamed: 0,A,B,C,D
a,-1.04415,-0.889607,0.36396,-0.275587
b,-0.358961,-0.713398,0.613072,0.782329
c,-0.77798,-1.277393,-0.90849,0.943944
d,2.207005,-0.050901,-0.228822,-0.338943
e,-1.187508,-3.762726,0.13439,-0.422286
f,-0.721983,1.901407,1.187396,-0.09344
g,0.163542,0.241777,-1.875747,0.482121


One thing to note is that each of the columsn of the `DataFrame` is actually just a `Series'

<h3>Accessing the Columns and Index of a DataFrame</h3>

Often times, especially when dealing with large DataFrames, we want to access the columsn or the index of a DataFrame. We can do that as follows:

In [151]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [152]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

<h3>Selecting Elements From a DataFrame</h3>

<h4> Selecting Columns</h4>

In [153]:
df['A']

a   -1.070892
b    0.473238
c   -0.446515
d    0.173181
e    0.058209
f    1.083051
g    0.515035
Name: A, dtype: float64

In [154]:
df['C']

a   -0.223463
b   -0.846794
c    0.214094
d   -0.883857
e    0.357787
f   -1.377669
g    0.515048
Name: C, dtype: float64

We can check the type of each column and confirm that it is just a `Series`:

In [155]:
type(df['C'])

pandas.core.series.Series

Similarly, we can check the type of the entire DataFrame:

In [156]:
type(df)

pandas.core.frame.DataFrame

Though, it is not usually recommended, as it may interfere with Panda's built-in methods, we may also grab columns in the following way:

In [157]:
df.A

a   -1.070892
b    0.473238
c   -0.446515
d    0.173181
e    0.058209
f    1.083051
g    0.515035
Name: A, dtype: float64

The problem with selecting columns this way is if our column name is also a built-in method in Pandas, so we may accidentally override that method which can lead to confusion and problems down the line. 

<h4> Selecting Multiple Columns</h4>

We can simultaneously select multiple-columns from the DataFrame by passing in a list with the column names. In this case we will get back a DataFrame:

In [158]:
df[['A','C']]

Unnamed: 0,A,C
a,-1.070892,-0.223463
b,0.473238,-0.846794
c,-0.446515,0.214094
d,0.173181,-0.883857
e,0.058209,0.357787
f,1.083051,-1.377669
g,0.515035,0.515048


<h4>Selecting Rows</h4>

To select rows from a DataFrame we need to use the `.loc()` method, and pass on the index label of the row we want to select. As in the case of columns, we will get back a Series:

In [159]:
df.loc['f']

A    1.083051
B    1.053802
C   -1.377669
D   -0.937825
Name: f, dtype: float64

<h4> Selecting Multiple Rows</h4>

To select multiple rows, we can do so by passing on a list of the index lables for the rows we want to select to the `.loc()` method. In this case we will get back a DataFrame:

In [160]:
df.loc[['b','c','e','g']]

Unnamed: 0,A,B,C,D
b,0.473238,-0.072829,-0.846794,-1.514847
c,-0.446515,0.856399,0.214094,-1.245739
e,0.058209,-1.14297,0.357787,0.560785
g,0.515035,0.513786,0.515048,3.852731


We can also select rows by passing on their index position, instead of the actual label. In this case we need to use the `.iloc()` method.

For example we can select the same subframe as above in the following way:

In [180]:
df.iloc[[1,2,4,6]]

Unnamed: 0,A,B,D,A+C,new,new2,new3
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h2><font color='red'>Practice Exercise</font></h2>

Do the following tasks:
<ul>
    <li>Create a DataFrame with 5 columns and 11 rows, with customized columns and index labels</li>
    <li> Select columns 2,3, and 5 all at once</li>
    <li>Select rows 1,5,8, and 9 at once</li>
</ul>
<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h3>Creating New Columns</h3>

Often times we'll find ourselves needing to add one or more extra column(s) to our DataFrame. There are multiple ways of doing this depending on the type of situation we are in. We list some of them below:

<b> If we want to add the sum of two columns as a new column:</b>

In [162]:
df['A+C']=df['A']+df['C']

In [163]:
df

Unnamed: 0,A,B,C,D,A+C
a,-1.070892,0.482472,-0.223463,0.714,-1.294355
b,0.473238,-0.072829,-0.846794,-1.514847,-0.373556
c,-0.446515,0.856399,0.214094,-1.245739,-0.232421
d,0.173181,0.385317,-0.883857,0.153725,-0.710677
e,0.058209,-1.14297,0.357787,0.560785,0.415996
f,1.083051,1.053802,-1.377669,-0.937825,-0.294618
g,0.515035,0.513786,0.515048,3.852731,1.030083


We can also add a completely new column with any input we want, as long as it matches the length of the columns in our dataframe:

In [164]:
df['new']=randn(7)

In [165]:
df

Unnamed: 0,A,B,C,D,A+C,new
a,-1.070892,0.482472,-0.223463,0.714,-1.294355,0.570891
b,0.473238,-0.072829,-0.846794,-1.514847,-0.373556,1.135566
c,-0.446515,0.856399,0.214094,-1.245739,-0.232421,0.954002
d,0.173181,0.385317,-0.883857,0.153725,-0.710677,0.651391
e,0.058209,-1.14297,0.357787,0.560785,0.415996,-0.315269
f,1.083051,1.053802,-1.377669,-0.937825,-0.294618,0.758969
g,0.515035,0.513786,0.515048,3.852731,1.030083,-0.772825


Recall that `rand(7)` creates a 1D array with 7 entries.

Another way to add a new column, which is more costumizable, is the following method, where we can specify the index labels where we want the data entered:

In [167]:
df.loc[df.index,'new2']=randn(7)

In [168]:
df

Unnamed: 0,A,B,C,D,A+C,new,new2
a,-1.070892,0.482472,-0.223463,0.714,-1.294355,0.570891,-0.471932
b,0.473238,-0.072829,-0.846794,-1.514847,-0.373556,1.135566,1.088951
c,-0.446515,0.856399,0.214094,-1.245739,-0.232421,0.954002,0.06428
d,0.173181,0.385317,-0.883857,0.153725,-0.710677,0.651391,-1.077745
e,0.058209,-1.14297,0.357787,0.560785,0.415996,-0.315269,-0.715304
f,1.083051,1.053802,-1.377669,-0.937825,-0.294618,0.758969,0.679598
g,0.515035,0.513786,0.515048,3.852731,1.030083,-0.772825,-0.730367


To appreciate the power of this second method, suppose that we only want to enter data on a few rows, but now all rows. For example, say we want to add a new column, all it `new3` with entries only on rows `b, d, f`.

In [174]:
df.loc[['b','d','f'],'new3']=[10,20,30]

In [175]:
df

Unnamed: 0,A,B,C,D,A+C,new,new2,new3
a,-1.070892,0.482472,-0.223463,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-0.846794,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,0.214094,-1.245739,-0.232421,0.954002,0.06428,
d,0.173181,0.385317,-0.883857,0.153725,-0.710677,0.651391,-1.077745,20.0
e,0.058209,-1.14297,0.357787,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-1.377669,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,0.515048,3.852731,1.030083,-0.772825,-0.730367,


<h3> Dropping Columns & Rows</h3>

<h4> Dropping Columns</h4>

To drop columns we may use the `.drop()` method. In this case, we also need to specify the `axis=1`, as by default it is `axis=0`, which refers to the index of the DataFrame. If we want the drop to be permanent, we need to also specify `inplace=True`, as by default it is set to `False`. Say we wanted to drop column `C`:

In [176]:
df.drop('C',axis=1)

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
d,0.173181,0.385317,0.153725,-0.710677,0.651391,-1.077745,20.0
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


Since we have not set `inplace=True` the drop is not permanent. In other words, if we call the DataFrame `df`, we will see that the column `C` is still there:

In [177]:
df

Unnamed: 0,A,B,C,D,A+C,new,new2,new3
a,-1.070892,0.482472,-0.223463,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-0.846794,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,0.214094,-1.245739,-0.232421,0.954002,0.06428,
d,0.173181,0.385317,-0.883857,0.153725,-0.710677,0.651391,-1.077745,20.0
e,0.058209,-1.14297,0.357787,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-1.377669,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,0.515048,3.852731,1.030083,-0.772825,-0.730367,


Next, let's set `inplace=True` and check the DataFrame again:

In [178]:
df.drop('C',axis=1,inplace=True)

In [186]:
df

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
d,0.173181,0.385317,0.153725,-0.710677,0.651391,-1.077745,20.0
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


We may also drop multiple columns at the same time by simply passing on a list of the columns we want to drop. Say for example that we wanted to drop columns `A` and `new`, then we can do that as follows:

In [187]:
df.drop(['A','new'],axis=1)

Unnamed: 0,B,D,A+C,new2,new3
a,0.482472,0.714,-1.294355,-0.471932,
b,-0.072829,-1.514847,-0.373556,1.088951,10.0
c,0.856399,-1.245739,-0.232421,0.06428,
d,0.385317,0.153725,-0.710677,-1.077745,20.0
e,-1.14297,0.560785,0.415996,-0.715304,
f,1.053802,-0.937825,-0.294618,0.679598,30.0
g,0.513786,3.852731,1.030083,-0.730367,


<h4>Dropping Rows</h4>

Dropping rows is very similar to dropping columns, the only difference is that we need to set `axis=0` (in fact this is the default state, but in case we want to make that explicit). For example, say we wanted to drop row `d`, then we can do so as follows:

In [189]:
df.drop('d',axis=0)

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


Similarly to columns, if we don't specify `inplace=True` the drop is not permanent:

In [190]:
df

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
d,0.173181,0.385317,0.153725,-0.710677,0.651391,-1.077745,20.0
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


To make the drop of row `d` permanent then we can do as follows:

In [191]:
df.drop('d',axis=0,inplace=True)

In [192]:
df

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
b,0.473238,-0.072829,-1.514847,-0.373556,1.135566,1.088951,10.0
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
e,0.058209,-1.14297,0.560785,0.415996,-0.315269,-0.715304,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0
g,0.515035,0.513786,3.852731,1.030083,-0.772825,-0.730367,


We can also drop multiple rows at the same time in a similar fashion as in the columns case:

In [193]:
df.drop(['b','e','g'])

Unnamed: 0,A,B,D,A+C,new,new2,new3
a,-1.070892,0.482472,0.714,-1.294355,0.570891,-0.471932,
c,-0.446515,0.856399,-1.245739,-0.232421,0.954002,0.06428,
f,1.083051,1.053802,-0.937825,-0.294618,0.758969,0.679598,30.0


<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h2><font color='red'>Practice Exercise</font></h2>

Using the same dataframe you created above, do the following:

<ul>
    <li> Add a new column called <b>new_col1</b> that consists of all ones</li>
    <li> Add a new row called <b> new_row</b> that consists of all integers</li>
    <li>Drop columns <b>one</b> and <b>three</b> permanently</li>
    <li> Drop rows <b>2, 5, 8</b> permanently</li>
 </ul>
 


In [199]:
#Start your soluiton here


<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h3> Selecting Subsets of Columns and Rows</h3>

Selcting subsets of columns and rows is very similar to the way we selected subarrays from NumPy arrays.

For example if we want to `a, c, e, g` and columns `A, D, new2, new3`, then we may do so by using the `.loc()` method and passing a list of rows and columns we want to select, separated by a comma:

In [195]:
df.loc[['a','c','e','g'],['A','D','new2','new3']]

Unnamed: 0,A,D,new2,new3
a,-1.070892,0.714,-0.471932,
c,-0.446515,-1.245739,0.06428,
e,0.058209,0.560785,-0.715304,
g,0.515035,3.852731,-0.730367,


Similarly, we may access a single entry as well. For example we may access the entry in row `e` and column `D` as follows:

In [198]:
df.loc['e','D']

0.5607845263682344

<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h2>Conditional Selection</h2>

Among the best and very useful features of Pandas is the ability to perform conditional selection. This is akin to the NumPy scenario. 

For illustration let us take the following DataFrame:

In [205]:
df=pd.DataFrame(randn(13,4),columns=['A','B','C','D'])

In [208]:
df

Unnamed: 0,A,B,C,D
0,-0.471038,0.23205,-1.448084,-1.407464
1,-0.718444,-0.213447,0.310908,1.475356
2,0.85766,-0.159939,-0.019016,-1.002529
3,-0.018513,-0.288659,0.322719,-0.827231
4,0.519347,1.532739,-0.10876,0.401712
5,0.690144,-0.40122,0.224092,0.012592
6,0.097676,-0.77301,0.02451,0.497998
7,1.451144,0.959271,2.153182,-0.767348
8,0.872321,0.183342,2.189803,-0.808298
9,-0.839722,-0.599393,-2.123896,-0.525755


In [209]:
df>0

Unnamed: 0,A,B,C,D
0,False,True,False,False
1,False,False,True,True
2,True,False,False,False
3,False,False,True,False
4,True,True,False,True
5,True,False,True,True
6,True,False,True,True
7,True,True,True,False
8,True,True,True,False
9,False,False,False,False


As we can see from above `df>0` returns a boolean with a value of `True` at entries where the condition was met and `False` otherwise. If we actually wanted to see the values that meet the condition then we may do the following:

In [210]:
df[df>0]

Unnamed: 0,A,B,C,D
0,,0.23205,,
1,,,0.310908,1.475356
2,0.85766,,,
3,,,0.322719,
4,0.519347,1.532739,,0.401712
5,0.690144,,0.224092,0.012592
6,0.097676,,0.02451,0.497998
7,1.451144,0.959271,2.153182,
8,0.872321,0.183342,2.189803,
9,,,,


In practice we will almost never use this. Instead, often we will be interested in selecting the subframe where  a certain column or a subset of columns satisfy certain conditions.

For example, suppose we only care for the values where column `A` is positive. To select the subset of values that meet this condition we may do as follows:

In [211]:
df[df['A']>0]

Unnamed: 0,A,B,C,D
2,0.85766,-0.159939,-0.019016,-1.002529
4,0.519347,1.532739,-0.10876,0.401712
5,0.690144,-0.40122,0.224092,0.012592
6,0.097676,-0.77301,0.02451,0.497998
7,1.451144,0.959271,2.153182,-0.767348
8,0.872321,0.183342,2.189803,-0.808298
11,0.950424,-0.576904,-0.898415,0.491919


As we can see from above, this returns only the rows with positive values in column `A` and drops all other rows.

<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h2><font color='red'>Practice Exercise</font></h2>

Grab only columns `A` and `C` from Dataframe `df` such that the corresponding rows have negative `D` values

In [221]:
#Your answer goes here


<hr style="height:2px;border:none;color:#333;background-color:#333;" />

<h3> Multiple Conditions</h3>

We may also select portions of the dataframe that satisfy more than one condition. For example, if we wanted to select only the elements of the dataframe `df` that have positive values in column `A` <b>AND</b> negative values in column `C` then we may do so by using the `&` (`AND`)voperator:

In [223]:
df[(df['A']>0) & (df['C']<0)]

Unnamed: 0,A,B,C,D
2,0.85766,-0.159939,-0.019016,-1.002529
4,0.519347,1.532739,-0.10876,0.401712
11,0.950424,-0.576904,-0.898415,0.491919


On the other hand, if we wanted to select only the elemnets of the dataframe that have positive `A` values <b> OR</b> `C` values smaller than `-2`, we may do so by using the `|` (`OR`, `pipe`) operator:

In [233]:
df[(df['A']>0) |(df['C']<-2)]

Unnamed: 0,A,B,C,D
2,0.85766,-0.159939,-0.019016,-1.002529
4,0.519347,1.532739,-0.10876,0.401712
5,0.690144,-0.40122,0.224092,0.012592
6,0.097676,-0.77301,0.02451,0.497998
7,1.451144,0.959271,2.153182,-0.767348
8,0.872321,0.183342,2.189803,-0.808298
9,-0.839722,-0.599393,-2.123896,-0.525755
11,0.950424,-0.576904,-0.898415,0.491919


In [246]:
df.loc[(df['A']>0) & (df['C']<0)]

Unnamed: 0,A,B,C,D
2,0.85766,-0.159939,-0.019016,-1.002529
4,0.519347,1.532739,-0.10876,0.401712
11,0.950424,-0.576904,-0.898415,0.491919


Another important and very useful conditional selection is if the values fall within a set of values. For example, if we want to select only the portion of the data whose values of a certain column fall in a given set of values, we can use the `.isin(values)` method. 

Consider the following dataframe:

In [261]:
df1=pd.DataFrame(np.random.randint(1,12,size=(8,4)),columns=['X','Y','Z','W'])

In [262]:
df1

Unnamed: 0,X,Y,Z,W
0,7,4,11,10
1,1,11,6,9
2,11,8,5,11
3,2,3,6,4
4,1,11,10,6
5,10,7,11,9
6,8,11,10,8
7,3,4,4,8


In [266]:
df1[df1['X'].isin(np.array([7,2,30,9,3]))]

Unnamed: 0,X,Y,Z,W
0,7,4,11,10
3,2,3,6,4
7,3,4,4,8
