# DS104 Data Wrangling and Visualization : Lesson One Companion Notebook

### Table of Contents <a class="anchor" id="DS104L1_toc"></a>

* [Table of Contents](#DS104L1_toc)
    * [Page 1 - Introduction ](#DS104L1_page_1)
    * [Page 2 - Adding Columns in R](#DS104L1_page_2)
    * [Page 3 - Adding Columns in Python](#DS104L1_page_3)
    * [Page 4 - Renaming Columns](#DS104L1_page_4)
    * [Page 5 - Renaming Columns in Python](#DS104L1_page_5)
    * [Page 6 - Combining Columns in R](#DS104L1_page_6)
    * [Page 7 - Separating Columns in R](#DS104L1_page_7)
    * [Page 8 - Combining Columns in Python](#DS104L1_page_8)
    * [Page 9 - Separating Columns in Python](#DS104L1_page_9)
    * [Page 10 - Subsetting Data in R](#DS104L1_page_10)
    * [Page 11 - Subsetting Data in Python](#DS104L1_page_11)
    * [Page 12 - Key Terms](#DS104L1_page_12)
    * [Page 13 - Hands On](#DS104L1_page_13)
    * [Page 14 - Hands On Practice - Solution](#DS104L1_page_14)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction <a class="anchor" id="DS104L1_page_1"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Manipulating Columns and Rows
VimeoVideo('241243071', width=720, height=480)

# Introduction 

This lesson marks the start of your data wrangling journey.  One of the best-kept secrets in data science is that you will spend most of your time wrangling the data into the right format, not actually running analyses.  Prior to this, data has mostly been clean and ready for your immediate usage, so you'll now start to get a feel for the work that goes into preparing a dataset for analysis.  You'll start with the basics of data manipulation - learning to play around with columns and rows.

By the end of this lesson, you should be able to: 

* Add new columns
* Rename columns
* Split columns up
* Combine columns together
* Subset your data to select only some columns and/or rows

This lesson will culminate with a hands on in which you will manipulate a dataset about fake news stories.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/429857151"> recorded live workshop on the Python material in this lesson </a> or this <a href="https://vimeo.com/436301467"> recorded live workshop on the R material in this lesson </a> </p>
    </div>
</div>



In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Manipulating Columns and Rows
VimeoVideo('429857151', width=720, height=480)

In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Manipulating Columns and Rows
VimeoVideo('436301467', width=720, height=480)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Adding Columns in R<a class="anchor" id="DS104L1_page_2"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


There will come a time in any data scientist’s life when you will need to add or remove columns and rows.  You may also need to take what’s in one column and make it two, or combine two columns into one. The formal term for smooshing columns together is *concatenation*. You'll be playing around with data manipulation using **[this dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/babies.zip)**.

---
## Adding Columns in R

Adding in a new column in R is relatively easy.  All that needs to happen is to specify the dataset and the name of the new column before the ```=``` , and then you can add anything you want into the column.  In the example shown below, you are creating a new column named ```Footprint``` that is blank, because you have contained a space between the double quotes.  However, you could instead add in any character string you wanted in the quotes or add a number (not in quotes).  You could even conditionally format that column based on information contained in other columns, which is called *recoding*.  You will learn how to recode soon. 

```{r}
babies$Footprint = " "
```

Now it’s important to note that you cannot create a new column and a new dataset at the same time, so if you think you are doing anything you might need to revert later, it is a good idea to save it as a new data frame first, just in case.

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Adding Columns in Python<a class="anchor" id="DS104L1_page_3"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Adding Columns in Python

For the next few lessons, you'll need the Python package `pandas`, so make sure you run the following code if you're following along:

```python
import pandas as pd
```

Adding columns in Python is also a snap.  Simply call the data frame, and then place in square brackets the name of the new column, and provide the value after the equals sign, like so: 

```python
babies['Footprint'] = 'Y'
```

You have now created a column in this data set that indicates whether or not a baby has had his or her footprint taken yet, and filled every instance of this column with the value ```Y``` standing for “yes.” 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Renaming Columns<a class="anchor" id="DS104L1_page_4"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Renaming Columns

Column names can also be changed in both R and Python with little fuss.  Although it’s always nice to be able to rename columns to something that is meaningful for you to work with, it becomes especially important if the source of your data allows spaces in the header row.  For instance, in MS Excel, you are allowed to have headers at the top of your data with spaces in them.  However, R and Python tend to throw up errors when you try to call any columns that have spaces embedded, and while R will default to removing spaces and instead placing a period as a separator in most cases, Python will not.  So renaming columns becomes particularly essential then! 

 <div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You may be thinking to yourself, "Isn't it so much easier to just remove the spaces in MS Excel then hard-coding it in R or Python?" The answer is probably "yes," but there will come a time when your data is so big that you can't even open MS Excel without it crashing.  So it's imperative you know how to do it in another program as well.</p>
    </div>
</div>

---

## Renaming Columns in R 

Below is the code to rename in R.  First you call the ```names``` function, and then specify the dataset that you want to name.  Then, in square brackets, you again specify ```names``` and the dataset, but in addition, place after the double equals sign the name of the original variable.  Lastly, after the ```<-``` , you  will place your new name for the column in double quotes.

```{r}
names(babies)[names(babies) == "ParentPhoneNumber"] <- "Phone"
```

The code above with use the ```names``` function to rename the column ```ParentPhoneNumber``` to ```Phone``` in the ```babies``` dataset. 

---

## Choosing Column Names

Make sure to choose column names with care.  Good column names can make the data analysis process go much easier, since it is easy to tell what data is contained with them and it does not take long to reference the columns.  Poor column name choices can increase the complexity of your data analysis by confusing yourself and others, both with a poor name/data fit and with an increased chance of making typos.  A good column name will be short, succinct, and easily understood by someone who does not work with the data regularly.  It will also be easily read – so make sure that column names with more than one word are either in *camel case* (EveryFirstLetterCapitalized), or have some other delineation like periods (periods.between.words) or underscores (underscore_between_words).  Using some typical naming conventions will save you many headaches later on.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Renaming Columns in Python<a class="anchor" id="DS104L1_page_5"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Renaming Columns in Python

When compared to R, renaming columns in Python has slightly simpler syntax, using the ```.rename``` function.  Specify the data frame, then use ```.rename```.  The arguments you'll include are ```columns=```, which is where you'll put a key-value pair consisting of: ```{'OldValue' : 'NewValue'}```, and ```inplace=True```, which allows you to make this change permanently added to your data frame. 

```python
babies.rename(columns={'ParentPhoneNumber' : 'Phone'}, inplace=True)
```

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Python does not support periods between words, so this naming convention will not convert between programs!</p>
    </div>
</div>

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Combining Columns in R<a class="anchor" id="DS104L1_page_6"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Combining Columns in R

The opposite of splitting columns is being able to put those columns back together again.  For instance, they may be stored in your database separately for ease of use and analysis (sorting by last name, anyone?), but when you actually want to provide a data printout to a customer, such as a list of names and email address to contact, you may want it in a format that is easier to read.  In R, this function is called ```unite```, and it is contained within the `tidyr` package: 

```{r}
install.packages("tidyr")
library("tidyr")
```

The code is shown below, where you specify the name of the new dataset before the arrow, the call the unite function, type in the name of the original dataset, type in the name of the column you want to create that will contain the information from the current columns, and then specify the columns that you want to combine and how you want the data to be combined. Take a look: 

```{r}
babies2 <- unite(babies1, Address, StreetAddress, City, Zipcode, sep = "/")
```

So this function above will create a new column named ```Address``` that will be made from the three columns ```StreetAddress```, ```City```, and ```Zipcode```.  The argument ```sep=``` to specifies how your data is broken up. The separator in this case is a backslash.

You can choose to mush everything together with no separators, but typically for readability you might want to add a space, comma, or other separator.  Adding a separator also means that you can take them apart again easily later if you need to. 

It is very important to choose a separator that will not occur naturally in your data.  For instance, if you look up to the previous figure in the ```StreetAddress``` column, you will find that dashes, periods, commas, and even hashtags all appear within that column.  If you had chosen to use one of those separators rather than the front slash, it would mess with your data!  R starts parsing columns as soon as it sees the separator, so if it comes earlier or later than expected, your columns won’t all have the same data in them.  You will know you’ve made this mistake if you see this warning: 

![A warning message reads, expected three pieces. Additional pieces discarded in 37 rows open square bracket 2, 7, 9, 11, 12, 13, 15, 16, 17, 19, 20, 21, 28, 36, 38, 42, 43, 48, 52, 53, … close square bracket.](Media/104.L1.45.png)

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>This problem will generate a warning, but will STILL RUN.  So always make sure you examine your columns carefully to make sure things went as expected.</p>
    </div>
</div>

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Separating Columns in R<a class="anchor" id="DS104L1_page_7"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Separating Columns in R

Typically, if you are transferring data from a file into R, even if all the data is stored in one giant chunk, you can delimit it upon import.  However, there may be times when you still need to break columns apart.  The way you do this in R is with the ```separate()``` function, which is also part of the ```tidyr``` package.  It allows you to go from data like this, where your ```City``` and ```Zipcode``` columns are all in one column… 

![A table has seven columns with eleven entries. The column headings are serial number, full name, birthday, address, weight, height, and hospital ID. The row entries are as follows. Row 1, 1, Potter Alec J., 2017-11-06, 229-8390 Dignissim. Road/Ucluelet/36170, 1, 9, 1.672053 e plus 12. Row 2, 2, McClure Lucius S., 2018-08-14, RO Box 593, 7892 Nunc St./Licanten/47279. Row 3, 3, Farley Hilel Y., 2017-12-07, 3859 Tellus Rd,/Alva/78490, 6, 5, 1.664012 e plus 12. Row 4, 4, Cukmings Randall W, 2018-03-07, Ap number 293-6684 Lobortis Street/Vicuria/36395, 9, 9, 1.669073 e plus 12. Row 5, 5, LesterBruce I., 2018-01-14, 495-3192 Dictum St./Bonnert/25130, 10, 15, 1.603101 e plus 12. Row 6, 6, Wall Simone D, 2018-03-21, 6354 Sed St./Miraj/91893, 15, 17, 1.623102 e plus 12. Row 7, 7, Henson Hayden J, 2018-05-22, 942-3292 Luctus, Avenue/Windermere/85577, 14, 11, 1.644103 e plus 12. Row 8, 8, Suarez Mia I, 2019-04-08, Ap number 118-8911 Quisque Ave/Lampeter/20470, 14, 19, 1.663093 e plus 12, Row 9, 9, Allison Troy C, 2018-10-22, PO Box 251, 1889 Sem St./Bousval/91658, 15, 3, 1.646122 e plus 12. Row 10, 10 Watson Abraham S, 2018-04-25, Ap number 205-6099 Ac Avenue/Kleinmachnow/70319, 9, 16, 1.634072 e plus 12. Row 11, 11, Carr Preston HL, 2017-11-10, PO Box 293, 9756 Ut Ave/Rajahmundry/58093, 9, 12, 1.624051 e plus 12.](Media/104.L1.41.png)

To data like this, where you were able to split those columns apart. 

![A table has nine columns with eleven-row entries. The column headings are Last name, first name, middle initial, birthday, street address, city, zip code, and weight. Row 1, 1, Potter, Alec, J, 2017-11-06, 229-8390 Dignissim road, Ucluelet, 36170, blank. Row 2, 2, McClure, Lucius, S, 2018-08-14, PO Box 593, 7892 Nunc St., Licanten, 47279, blank. Row 3, 3, Farley, Hilel, Y, 2017-12-07, 3859 Tellus rd, Alva, 78490, blank. Row 4, 4, Cummings, Randall, W, 2018-03-07, Ap number 293-6684 Lobortis street, Vicuna, 36395, blank. Row 5, 5, Lester, Bruce, L, 2018-01-14, 495-3192 Dictum St., Bonnert, 25130, blank. Row 6, 6, Wall, Simone, D, 2018-03-21, 6354 Sed St., Miraj, 91893, blank. Row 7, 7, Henson, Hayden, J, 2018-05-22, 942-3292 Luctus, Avenue, Windermere, 5577, blank. Row 8, 8, Suarez, Mia, I, 2019-04-08, Ap number 118-8911 Quisque Ave, Lampeter, 20470. Row 9, 9, Allison, Troy, C, 2018-10-22, PO box 251, 1889 Sem St., Bousval, 91658, blank. Row 10, 10, Watson, Abraham, S, 2018-04-25, Ap number 205-6099 Ac Avenue, Kleinmachnow, 70319, blank. Row 11, 11, Carr Preston, H, 2017-11-10, PO Box 293, 9756 Ut Ave, Rajahmundry, 58093, blank.](Media/104.L1.42.png)

Here are some common examples of when you might separate columns: 

  * First and Last name are stored together, but you need to use them separately.
  * Addresses are stored as we would write them out, not separated into street address, city, state, and zipcode. 
  * Inches and feet for height are stored together.
  * Month, Day, and Year are stored together, but you’d like to look at only month or year. 

Here is the code you will use to separate columns in R for the ```Address``` column: 

```{r}
babies1 <- separate(babies, Address, c("StreetAddress", "City", "Zipcode"), sep="/")
```

After you call the ```separate()``` function, you will then put in the name of the current data set, followed by the name of the column you are breaking apart.  You will place the names of the new columns you would like to create from the original column in a vector, denoted by the ```c()```. Each new column to be created should be in quotes and separated by a comma. Lastly, you will provide the argument ```sep=```, which is for specifying your separator, or the way in which you will break up the columns. This could be a character like a ```/``` or ```,```, or it could be blank in the quotes, indicating a space.  Whenever R finds the thing placed within the quotes, it will make a new column, and those chunks of information will be placed, in order, into the new columns that you specified. In the case of the code above, you are splitting out the ```Address``` column into three columns, separating at the backslash, and those three new columns will be named ```StreetAddress```, ```City```, and ```Zipcode```.

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Combining Columns in Python<a class="anchor" id="DS104L1_page_8"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Combining Columns in Python

The Python syntax for combining columns reads much more like a sentence, which makes it rather user friendly.  First, specify the name of the new data frame and column you'll create, then after the equals sign, place the name of the first column you would like to combine, the separator, and the remaining columns you would like to add.  Adding the ```(str)``` at the end means that the variable will become a string (character) if it was not already. 

```python
babies['FullName'] = babies["Name"] + " " + babies["First"].map(str)
```

Unlike R, Python not only creates a new column with the concatenation, but also leaves the old columns untouched, which is a nice feature.

This function becomes slightly more lengthy if you are concatenating more than one column; however; it stills reads and has the logic of a sentence: 

```python
babies['Address'] = babies["Street Address"] + " / " + babies["City"] + " / " + babies["Zipcode"].map(str) 
```

The above code combines three columns into the new column of ```Address```, and between each part, there is a separator of a space. 

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Separating Columns in Python <a class="anchor" id="DS104L1_page_9"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Separating Columns in Python

Unfortunately, separating columns in Python can be a bit unwieldy.  Although it is quite easy to actually separate them, the new separated columns are automatically placed inside a new dataframe instead of the one that contains the original column.  This means that in addition to separating the column, you also need to add into your current data frame the new separated columns.

---

## Splitting the Columns

The first step is to separate the column.  As shown below, you will create a new data frame, which will then contain your columns after you’ve split them.  You call the ```str.split()``` function, and then specify what your data is broken up by (your separator).  In this case, it is a front slash.  The ```expand=True``` argument is very important – it makes each separated section its own column in a data frame rather than just producing a list.  

```python
babies1 = babies['Address'].str.split('/', expand=True)
```

This code separates the ```Address``` column based on the forward slash; however, you will notice that the columns are not labeled at all, just zero indexed. 

![Snapshot of a window reads In open square bracket 31 close square bracket babies1.head open and close brackets. Out open square bracket 31 close square bracket. A table with three columns labeled 0, 1, and 2. The row entries are as follows. Row 1, 0, 229-8390 Dignissim Road, Ucluelet, 36170. Row 2, 1, PO box 593, 7892 Nunc St., Licanten, 47279. Row 3, 2, 3859 Tellus rd, Alva, 78490. Row 4, 3, Ap number 293-6684 Lobortis street, Vicuna, 36395. Row 5, 4, 495-3192 Dictum St., Bonnert, 25130.](Media/104.L1.53.png)

---

## Renaming the Columns

The easy fix for that is to add in a ```.rename()``` function to the whole shebang like this:

```python
babies2 = babies['Address'].str.split('/', expand=True).rename(columns = lambda x: "Address"+str(x+1))
```

In the ```.rename()``` function, you are specifying that you want to rename columns with the argument ```columns=```, and then you'll use the ```lambda x``` function to say that you want the trunk of ```Address``` to be repeated every time, and then you’ll add numbers on.  You can leave them like this, or rename the columns like you learned above – it’s a personal preference. 

![Snapshot of a window reads In open square bracket 29 close square bracket babies1.head open and close brackets. Out open square bracket 29 close square bracket. A table with three columns labeled Address 1, Address 2, and Address 3. The row entries are as follows. Row 1, 0, 229-8390 Dignissim Road, Ucluelet, 36170. Row 2, 1, PO box 593, 7892 Nunc St., Licanten, 47279. Row 3, 2, 3859 Tellus rd, Alva, 78490. Row 4, 3, Ap number 293-6684 Lobortis street, Vicuna, 36395. Row 5, 4, 495-3192 Dictum St., Bonnert, 25130.](Media/104.L1.52.png)

---

## Adding the Columns Back In

The next step is to add those columns back into your dataset.  To do this, you will append your data side by side. In Python, the ```pandas``` package has a function called ```.concat()``` that will add in columns or rows.  All you need to do is specify the names of the datasets that you want to place side by side, in the order in which you want to see them side by side, in the square brackets and then specify ```axis=1``` to tell Python that you are adding columns, not rows. 

```python
babies3 = pd.concat([babies, babies2], axis=1)
```

![Snapshot of a window that reads In open square bracket 38 close square bracket babies2 equals pd.concat open bracket open square bracket babies, babies1 close square bracket, axis equals 1 close bracket. In open square bracket 39 close square bracket babies2.head open and close brackets. Out open square bracket 39 close square bracket. A table with eleven columns and four rows. The column headings labeled ress, city, zip code, weight, height, hospital Id, parent phone number, parent email, full name, address, 0.](Media/104.L1.54.png)

---






<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Subsetting Data in R<a class="anchor" id="DS104L1_page_10"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Subsetting Data in R

*Subsetting* is when you take a portion of your old data set and turn it into its own new dataset.  It's also a way to drop columns or rows you don't need. When you subset, you can choose which columns and which rows you’d like to take with you into the new dataset.

---

## Subsetting Using Indexes

In R, subsetting data is this easy: 

```{r}
babies6 <- babies[1:5, 1:3]
```

This will keep only the first five rows of data and the first three columns.  You will always specify the rows first, with the index of the starting row, followed by a colon and the index of the ending row.  Then you'll add a column, and put in your information with the columns.  The first number in the second set of brackets is the index of the starting column, and the second number is the index of your ending column.

Said another way, the first set of numbers is for the rows you want to keep.  The format is ```first : last```.  The second set of numbers is for the columns you want to keep.  Again, the format is ```first : last```.

---

## Subsetting Using Column Names

Alternatively, when dealing with columns, you can specify the names of the columns you want to keep.  This way, you don't need to worry about whether they are adjacent to each other or not.  You'll create a new vector, in this case, called ```Keeps```, that has the name of the columns you want to retain. Then, you'll apply that vector to the your dataset, placing it in square brackets, like this: 

```{r}
keeps <- c("Name", "Birthday", "ParentEmail")
babies7 <- babies[keeps]
```

The resulting dataset looks like this:

![A table has three columns with fourteen row entries. The column headings are name, birthday, parent email. Row 1, 1, Potter, 2017-11-06, ut at nisiAenean.ca. Row 2, 2, McClure, 2018-08-14, Donec.Juctus at Maecenas.edu. Row 3, 3, Farley, 2017-12-07, Aenean at imperdietnonvestibulum.ca. Row 4, 4, Cummings, 2018-03-07, velit.eu.sem at Aliquam.com. Row 5, 5, Lester, 2018-01-14,dolor.Donec at miAliquamgravida.com. Row 6, 6, Wall, 2018-03-21, Aenean.eget.magna at acfermentum.net. Row 7, 7, Henson,nec.mollis.vitae at idsapien.com. Row 8, 8, Suarez, 2019-04-08, sit at pellentesque.com. Row 9, 9, Allison, 2018-10-22, integer at necdiam.ca. Row 10, 10, Watson, 2018-04-25,mauris.ut.mi at ametmass.ca Row 11, 11, Carr, 2017-11-10, Nunc.mauris at Aenean.edu. Row 12, 12, Delacruz, 2018-03-23, sollicitudin.orci.sem at MorbivehiculaPellentesque.ca. Row 14, 14, Walton, 2018-07-21, Nullam.nisl.Maecenas at CurabiturmassaVestibulum.com.](Media/104.L1.60.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Subsetting Data in R<a class="anchor" id="DS104L1_page_11"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Subsetting Data in Python

You will now learn how to subset your data in Python.

---

## Subsetting Rows

A very similar subset command can be done in Python to limit the number of rows you have.  This particular code takes only the first three rows.  Remember Python has zero indexing, so choosing three rows means that the index number will actually read ```2```. 

```python
babies7 = babies[:3]
```

---

## Keeping Columns

For columns, you can select by column names, placing them in double square brackets:

```python
babies8 = babies[['Name', 'Birthday', 'ParentEmail']]
```

This new dataset, ```babies8```, will only be left with the columns of ```Name```, ```Birthday```, and ```ParentEmail```.

---

## Dropping Columns

You can also just drop columns, if you are keeping most of the columns and only getting rid of one or two. In the code below, you will be dropping ```ParentPhone``` out of the ```babies``` dataset using the function ```.drop```: 

```python
babies.drop(['ParentPhone'], axis=1)
```

As with other Python code, the ```axis=1``` argument tells Python to apply this information to the columns, not the rows.

---

## Summary

In this lesson, you got started with the fundamentals of data wrangling! You learned how to manipulate columns and rows in both Python and R, and you should now be able to complete the following tasks: 

* Adding columns
* Renaming columns
* Combining columns
* Separating columns
* Subsetting columns and rows

You will need these skills in order to tame your raw data and turn it into something useful! 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Key Terms<a class="anchor" id="DS104L1_page_12"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms 

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Concatenation</td>
        <td>Combining columns together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Separator</td>
        <td>A punctuation mark that can be used to break up your data into additional columns; often but not always a comma.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Measures</td>
        <td>What Tableau calls continuous variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Dimensions</td>
        <td>What Tableau calls categorical variables.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>names</td>
        <td>A function used to rename columns.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>separate()</td>
        <td>A function in the tidyr library to split up columns by a separator.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sep=</td>
        <td>An argument to separate() and unite() that specifies the separator within double quotes.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>unite()</td>
        <td>A function in the tidyr library to combine columns together.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>tidyr</td>
        <td>A package used for data manipulation and wrangling.</td>
    </tr>
</table>



---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.rename()</td>
        <td>A function to rename columns.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>inplace=True</td>
        <td>An argument to .rename() that changes the names permanently.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.str.split()</td>
        <td>A function that will split columns based on a separator.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>expand=True</td>
        <td>An argument for str.split() that will ensure each separated section becomes it's own column, rather than just being part of a list.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pd.concat()</td>
        <td>A function in pandas that adds two dataframes back together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.drop()</td>
        <td>A function where the specified columns are removed from the dataset.</td>
    </tr>
</table>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Hands On<a class="anchor" id="DS104L1_page_13"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Jupyter Notebook and Pandas Hands-On

This Hands-On will be graded.  The best way to become a data scientist is to practice!

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

You are working for an ecology company, and they have been tracking bison throughout North America. They've  collected **[data on the location, number, genus, and species of bison](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/BisonTracking.zip)**. They'd like to know some basic information about the bison, to determine whether the species is still in danger or whether it is recovering.  

Please perform the following tasks: 

* Read in your data as a CSV file
* Look at the first seven rows of your data
* Look at the last ten rows of your data
* Determine the number of rows and columns your dataset has 


And answer the following questions: 

* How many bison are of the species antiquus? 
* What is the mean and standard deviation of Length? 
* What is the median length of the bison?  

Please annotate your code with markdown to explain each step, then attach your ipynb or an HTML copy of your notebook here, so your work can be graded. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Practice Hands on R Solution<a class="anchor" id="DS104L1_page_14"></a>

[Back to Top](#DS104L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# fake news stories

For your Lesson 1 Practice Hands-On, you will be completing the following requirements. This Hands-On will **not** be graded, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Please submit your actual R script / Python notebook.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---
## Requirements

Here is a dataset on fake news stories: **[Fake News Dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/FakeNews.zip)**. You will be practicing your column and row manipulations with it. 

---
### Part 1: Please complete the following tasks in R:

1. Add a column labeled ```StoryType``` and fill it with ```Fake```.
2. Remove the ```when``` column.
3. Separate the ```url``` column out so that you can see in one column the website and in the second column the domain.  For example, if you have the following in ```url```, it should be broken out like this: 

   http://wayback.archive.org/web/20161004072420id_/http://alertchild.com/

   Website: http://wayback.archive.org/web/20161004072420id
   Domain: /http://alertchild.com/ 

4.	Put back together the domain column. 
5.	Keep only the first ten rows of the data.

---
### Part 2: Please complete the same list of tasks above in Python.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>






# Part 1 Solution in R

To add a column labeled ```StoryType``` and fill it with ```Fake```: 

```{r}
FakeNews$StoryType = "Fake"
```

To remove the "when" column:

```{r}
FakeNews1 <- FakeNews[, 2:4]
```

To separate the URL column so you can see the website in one column and the domain in the other:

```{r}
library("tidyr")
FakeNews2 <- separate(FakeNews1, url, c("Website", "Domain"), sep="_")
```

To put back together the domain column you broke apart:

```{r}
FakeNews3 <- unite(FakeNews2, FullSiteName, Website, Domain, sep = "_")
```

To keep only the first ten rows of data: 

```{r}
FakeNews4 <- FakeNews3[1:10,]
```

---
## Part 2 Solution in Python

```python
import pandas as pd

FakeNews = pd.read_excel('C:/Users/meredith.dodd/Documents/New Curriculum/104 L1/FakeNews.xlsx')
FakeNews.head()

#Add a column labeled StoryType and fill it with Fake

FakeNews['StoryType'] = "Fake"
FakeNews.head()

#Remove the when column

FakeNews.drop(['when'], axis=1, inplace=True)
FakeNews.head()

#Separate the URL column into Website and Domain

FakeNews1 = FakeNews['url'].str.split('_', expand=True).rename(columns = lambda x: "URL"+str(x+1))
FakeNews1.head()

FakeNews2 = pd.concat([FakeNews, FakeNews1], axis=1)
FakeNews2.head()

FakeNews2.drop(['url'], axis=1, inplace=True)
FakeNews2.head()

#Put back together the domain column

FakeNews2['url'] = FakeNews2["URL1"] + "_" + FakeNews2["URL2"].map(str)
FakeNews2.head()

FakeNews2.drop(['URL1', 'URL2'], axis=1, inplace=True)
FakeNews2.head()

#Keep only the first ten rows of the data
FakeNews3 = FakeNews2[:10]
FakeNews3
```

