# Working with columns

### Introduction

In the last lesson, we learned about loading data into a pandas dataframe, and that we can think of a dataframe as a list of dictionaries, or as a list of lists.  We also saw that a pandas dataframe consists of rows, columns and an index.  

In this lesson, we'll learn more about how to select specific columns.  This is an important skill in machine learning.  Remember that even to fit our machine learning model, we had to separate our data between the features and the target.

```python
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X, y)

```
And we'll also need to select columns if we wish to *exclude* certain columns in our feature dataframe `X`.  Alright, let's see how we can work with columns.

### Exploring Columns

Let's get started by loading up our data once again.

> Press shift + return on the cell below.

In [2]:
import pandas as pd
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
df = pd.read_csv(url)
df[:1]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0


Now a good way to start exploring a dataframe is to view the names of all of the columns.  We can see a list of all of the columns in our dataframe with the `columns` method.

In [6]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

### Selecting a single column

And, as we know, we can select a specific column by using the bracket accessors.

In [7]:
df['year']

0       2013
1       2012
2       2013
3       2013
4       2013
        ... 
1789    1971
1790    1971
1791    1971
1792    1971
1793    1970
Name: year, Length: 1794, dtype: int64

Another way to select a specific column is with the dot, followed by the column name.

In [8]:
df.year

0       2013
1       2012
2       2013
3       2013
4       2013
        ... 
1789    1971
1790    1971
1791    1971
1792    1971
1793    1970
Name: year, Length: 1794, dtype: int64

However, the dot notation cannot be used with some column names, like those with spaces.

In [9]:
df.decade code

SyntaxError: invalid syntax (<ipython-input-9-c600b1b3a395>, line 1)

Let's practice by assigning the column `domgross` in `df` to the variable `y`.

In [1]:
y = None

> Below we'll do this for you.

> But don't look until you've tried.

In [17]:
y = df['domgross']
y

0       25682380.0
1       13414714.0
2       53107035.0
3       75612460.0
4       95020213.0
           ...    
1789    70327868.0
1790    10324441.0
1791    41158757.0
1792     4000000.0
1793     9000000.0
Name: domgross, Length: 1794, dtype: float64

### Selecting mulitple columns

Now let's move onto selecting multiple columns.  We can start by again taking a look at our available columns.

In [10]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

Now let's select `year` and `title` as our columns.

In [12]:
columns = ['year', 'title']
selected_df = df[columns]
selected_df[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


So we just greatly reduced the number of columns, and assigned this smaller dataframe to `selected_df`.  Let's go over how we did this.

We used the following format:

* dataframe, bracket accessors, list of columns 

```python
df[ ['col_1', 'col_2']]
```

It can be hard to keep track of all of those brackets, so it is nice to first assign the list of columns to a variable, and then pass this list through the bracket accessors.

In [14]:
columns = ['year', 'title']
selected_df = df[columns]

selected_df[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


Now it's your turn.  Try selecting the `domgross` and `intgross` columns from `df`.  Assign it to the variable `gross_cols`.

In [5]:
gross_cols = None

gross_cols[:3]

# 	domgross	intgross
# 0	25682380.0	42195766.0
# 1	13414714.0	40868994.0
# 2	53107035.0	158607035.0

Unnamed: 0,domgross,intgross
0,25682380.0,42195766.0
1,13414714.0,40868994.0
2,53107035.0,158607035.0


### Summary

In this lesson, we learned about how to select columns from our pandas dataframe.  We can start by seeing all of the columns with `columns` method.

In [23]:
df.columns

Index(['year', 'imdb', 'title', 'test', 'clean_test', 'binary', 'budget',
       'domgross', 'intgross', 'code', 'budget_2013$', 'domgross_2013$',
       'intgross_2013$', 'period code', 'decade code'],
      dtype='object')

We can select a single column by either using the bracket accessors or the dot notation, and then assign that column a variable.

In [18]:
year = df['year']

In [19]:
year = df.year

We can select multiple columns by still using the bracket accessors, and then passing through a list of columns that we would like to select.

In [3]:
cols = ['year', 'title']
selected = df[cols]
selected[:3]

Unnamed: 0,year,title
0,2013,21 &amp; Over
1,2012,Dredd 3D
2,2013,12 Years a Slave


<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-main.png" width="15%" style="text-align: center"></a>
</center>