# DS104 Data Wrangling and Visualization : Lesson Two Companion Notebook

### Table of Contents <a class="anchor" id="DS104L2_toc"></a>

* [Table of Contents](#DS104L2_toc)
    * [Page 1 - Introduction](#DS104L2_page_1)
    * [Page 2 - Data Transposition: See you on the Flip Side!](#DS104L2_page_2)
    * [Page 3 - Energy Practice Hands-On](#DS104L2_page_3)
    * [Page 4 - Energy Activity Solution R](#DS104L2_page_4)
    * [Page 5 - Transposing Data in Python](#DS104L2_page_5)
    * [Page 6 - Energy Activity Python](#DS104L2_page_6)
    * [Page 7 - Energy Activity Solution Python](#DS104L2_page_7)
    * [Page 8 - Transposing Data in Spreadsheets](#DS104L2_page_8)
    * [Page 9 - Combining Datasets Together](#DS104L2_page_9)
    * [Page 10 - Joining Datasets in R](#DS104L2_page_10)
    * [Page 11 - Appending in R](#DS104L2_page_11)
    * [Page 12 - Appending Activity in R](#DS104L2_page_12)
    * [Page 13 - Appending Activity Solution in R](#DS104L2_page_13)
    * [Page 14 - Merging Datasets in Python](#DS104L2_page_14)
    * [Page 15 - Appending in Python](#DS104L2_page_15)
    * [Page 16 - Combining in Python Activity](#DS104L2_page_16)
    * [Page 17 - Combining in Python Activity Solution](#DS104L2_page_17)
    * [Page 18 - Aggregating Data in R](#DS104L2_page_18)
    * [Page 19 - Aggregating Data in Python](#DS104L2_page_19)
    * [Page 20 - Pivot Tables](#DS104L2_page_20)
    * [Page 21 - Key Terms](#DS104L2_page_21)
    * [Page 22 - Data Transformation Hands-On](#DS104L2_page_22)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS104L2_page_1"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Transformations
VimeoVideo('241240224', width=720, height=480)

# Introduction

Sometimes data are not in the configuration needed to do a certain analysis. Sometimes the data are in raw form, and they need to be summarized a bit. Sometimes the data are in rows, and need to be in columns. Sometimes the data are in more than one table, and they need to be merged somehow.

All of these problems will be addressed in this lesson, and fall under the general heading of *data transformation*. 

By the end of this lesson, you should be able to: 

* Understand the difference between long and wide data
* Transpose data
* Differentiate between the four types of joins and append
* Join and append data
* Aggregate data
* Create pivot tables in MS Excel

This lesson will culminate in a hands-on which you transform data from Airbnb. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/430577617"> recorded live workshop </a> that goes over the Python material in this lesson. </p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Data Transposition: See you on the Flip Side!<a class="anchor" id="DS104L2_page_2"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Data Transposition: See you on the Flip Side! 

Transforming your data is all about the shape it takes. Data is typically called either “long” or “wide.” If you think back to your elementary school days, and your teacher talked about folding your paper “hotdog” style (long and skinny) versus “hamburger” style (short and fat), that gives you an idea of the data shapes you are talking about.

![A long hamburger, a table with thirteen columns and four entries. The row headings are labeled year, occupation, real price, real national income. A wide hamburger, a table with four columns and thirteen-row entries. The column headings are year, consumption, real price, and real national income.](Media/104.L1.0.gif)

Typically, if you want to run any longitudinal analyses that look at change over time, you will need your data to be stored in wide format, but datasets in the real world often come in long format. So being able to flip back and forth between long and wide formats will be an often-used and much-needed skill. 

---

# Transposing Data in R

If you’ve ever worked with any other data analysis programs, you will not believe just how easy it is to transform data in R and Python. One simple command and your data has been flipped! 

Here you have **[a dataset looking at tea consumption in the United Kingdoms from 1924-1936](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/tea.zip)**. If you wanted to examine how tea prices have changed over time, you wouldn’t be able to do it in the dataset’s native format, which is long.

![A table with five columns and thirteen-row entries. The column headings are labeled year, consumption, real price, and real national income. The row entries are as follows. Row 1, 1 1924, 395.5, 24.2, 3391. Row 2, 2, 1925, 398.7, 24.4, 3640. Row 3, 3, 1926, 403.1, 25.0, 3567. Row 4, 4, 1927, 410.0, 25.2, 3827. Row 5, 5, 1928, 417.4, 25.7, 3843. Row 6, 6, 1929, 430.0, 23.0, 3928. Row 7, 7, 11390, 440.0, 22.0, 4017. Row 8, 8, 1931, 456.0, 22.3, 3877.](Media/104.L1.1.png)

So, you need to use the function ```t()```. This falls under the ```tidyr``` package.

```{r}
tea1 <- t(tea)
```

Where ```tea1``` is your new dataset and ```tea``` is your old dataset.

And voila! Your data has been flipped as easy as that! The columns and rows have now switched places.  You can see that now you have your columns as the years and down the side for your rows you have the different variables that used to be columns.  This dataset is now ready for a longitudinal statistical analysis looking at the difference between years. 

![A table with thirteen columns and four rows. The row headings are labeled year, consumption, real price, real national income. The row entries are as follows. Row 1, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929, 1930, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, and 1936.0. Row 2, 395.5, 398.7, 403.1, 410.0, 417.4, 430, 440, 456.0, 455.0, 435.3, 430.5, 441.5, 438.5. Row 3, 24.2, 24.4, 25.0, 25.2, 25.7, 23, 22, 22.3, 21.4, 22.6, 24.2, 24.2, 24.2, 24.8. Row 4, 3391.0, 3640.0, 3567.0, 3827.0, 3843.0, 3928, 4017, 3877.0, 3922.0, 4162.0, 4656.0, 4850.0.](Media/104.L1.3.png)

Now you’ll notice that in the image above that there are no names on the columns.  This is because when you use ```t()```, it changes your data from a data frame into a matrix.  You can determine this by using the ```class()``` function on ```tea1```: 

```{r}
class(tea1)
```

And the result is: 

```text
[1] "matrix"
```

It can easily be turned into a data frame using the function ```as.data.frame()```: 

```{r}
tea2 <- as.data.frame(tea1)
```

Where ```tea2``` is the name of your new dataset and ```tea1``` was the name of your old dataset in matrix format.

And the ```class()``` function will help you verify that things went as planned:

```{r}
class(tea2)
```

With the result of: 

```text
[1] "data.frame"
```

So now when you look at the data, you now see column names. They’re not good ones, as they are the generic ones that R puts in, but they’re there.

![A table with thirteen columns and four rows. The row headings are labeled year, consumption, real price, real national income. The column headings are labeled V1 to V13. The row entries are as follows. Row 1, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929, 1930, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, and 1936.0. Row 2, 395.5, 398.7, 403.1, 410.0, 417.4, 430, 440, 456.0, 455.0, 435.3, 430.5, 441.5, 438.5. Row 3, 24.2, 24.4, 25.0, 25.2, 25.7, 23, 22, 22.3, 21.4, 22.6, 24.2, 24.2, 24.2, 24.8. Row 4, 3391.0, 3640.0, 3567.0, 3827.0, 3843.0, 3928, 4017, 3877.0, 3922.0, 4162.0, 4656.0, 4850.0.](Media/104.L1.7.png)

If you wanted to rename them, you could use the ```gsub()``` function within the ```names()``` function to do so.  This behaves in a similar manner to the ```lambda x``` function in Python. You'll call the ```names()``` function on the dataset, then fill it in using the results of the ```gsub()``` function.  The first argument for ```gsub()``` is what the base of what the column is currently named - in this case, ```V```.  Then you'll put in the new base you want to substitute, which is ```Year```.  Lastly, you'll call the ```names()``` function again on your current dataset, like this:

```{r}
names(tea2) <- gsub("V", "Year", names(tea2))
```

This replaces all the Vs, which stood for variable, with Year, using the ```gsub``` command.  Here’s how it looks when you are done:  

![A table with twelve columns and four rows. The row headings are labeled year, consumption, real price, real national income. The column headings are labeled from year 1 to year 12. The row entries are as follows. Row 1, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929, 1930, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0. Row 2, 395.5, 398.7, 403.1, 410.0, 417.4, 430, 440, 456.0, 455.0, 435.3, 430.5, 441.5. Row 3, 24.2, 24.4, 25.0, 25.2, 25.7, 23, 22, 22.3, 21.4, 22.6, 24.2, 24.2, 24.2. Row 4, 3391.0, 3640.0, 3567.0, 3827.0, 3843.0, 3928, 4017, 3877.0, 3922.0, 4162.0, 4419.0.](Media/104.L1.9.png)

Don’t like your data? You can always flip it back again with ```t( )``` easily.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Energy Practice Hands-On<a class="anchor" id="DS104L2_page_3"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In this activity, you will be completing the following requirements. This Activity will **not** be graded, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Please submit either screenshots of your code or your actual R script.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---
## Requirements

Here is a dataset on household energy consumption in the U.S: **[Energy](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/energy.zip)**

Give data transformations a shot on your own in R! 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Energy Activity Solution R<a class="anchor" id="DS104L2_page_4"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Activity Solution

To transpose this dataset in R: 

```{r}
energy1 <- t(energy)

energy2 <- as.data.frame(energy1)

names(energy2) <- gsub("V", "Year", names(energy2))
```


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Transposing Data in Python<a class="anchor" id="DS104L2_page_5"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Transposing Data in Python

You can use a similar process to transform data in Python, using the ```.T``` function: 

```python
tea.T
```

And below is your result.  As you can see, there is still a naming problem, no matter whether you use R or Python. But Python has a slightly more elegant solution that you can actually call as you transpose. It is called the ```set_index()``` and you will need to specify the name of the column that you want to pull the new column names from: 

```python
tea2 = tea.set_index('Year').T
```

Now, you have each column labeled with the appropriate year.  Pretty nifty, right? 

And if you don’t like it, or need to change it back, you can just call ```.T``` again, and things are back how they used to be.

![Snapshot of a window reads, in open square bracket 31 close square bracket tea 3 equals tea2.T. In open square bracket 32 close square bracket tea3. Out open square bracket 32 close square bracket. A table with four columns and thirteen row entries. The column headings are labeled year, consumption, real price, and real national income. The row entries are as follows. Row 1, 1924, 395.5, 24.2, 3391.0. Row 2, 1925, 398.7, 24.4, 3640.0, Row 3, 1926, 403.1, 25.0, 3567.0. Row 4, 1927, 410.0, 25.2, 3827.0. Row 5, 1928, 417.4, 25.7, 3843.0. Row 6, 1929, 430.0, 23.0, 3928.0. Row 7, 1930, 440.0, 22.0, 4017.0. Row 8, 1931, 456.0, 22.3, 3877.0. Row 9, 1932, 455.0, 21.4, 3922.0. Row 10, 1933, 435.3, 22.6, 4162.0. Row 11, 1936, 441.5, 24.2, 4656.0.](Media/104.L1.15.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Energy Activity Solution Python<a class="anchor" id="DS104L2_page_6"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


This Activity will **not** be graded, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Please submit your Python Notebook.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---
## Requirements

Here is a dataset on household energy consumption in the U.S: **[Energy](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/energy.zip)**

Give data transformations a shot on your own in Python! 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Energy Activity Solution Python<a class="anchor" id="DS104L2_page_7"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Activity Solution

To transpose this dataset in Python: 

```python
import pandas as pd

energy = pd.read_excel('C:/Users/meredith.dodd/Documents/New Curriculum/104 L1/energy.xlsx')
energy.head()

energy.T

energy1 = energy.set_index('Year').T
energy1.head()
```


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Transposing Data in Spreadsheets<a class="anchor" id="DS104L2_page_8"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Transposing Data in Spreadsheets

The transpose operation is common in any data manipulation tool that stores data in a tabular format. 

---

## Transposition in MS Excel

For MS Excel, all you need to do is copy the table portion that needs to be transposed. Then, when you want to paste it, rather than hitting ```Ctrl + V```, right click and select the transpose icon. Here's what that looks like:

![A table with thirteen columns and four rows. The row headings are labeled year, consumption, real price, real national income. The row entries are as follows. Row 1, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929, 1930, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, and 1936.0. Row 2, 395.5, 398.7, 403.1, 410.0, 417.4, 430, 440, 456.0, 455.0, 435.3, 430.5, 441.5, 438.5. Row 3, 24.2, 24.4, 25.0, 25.2, 25.7, 23, 22, 22.3, 21.4, 22.6, 24.2, 24.2, 24.2, 24.8. Row 4, 3391.0, 3640.0, 3567.0, 3827.0, 3843.0, 3928, 4017, 3877.0, 3922.0, 4162.0, 4656.0, 4850.0.](Media/104.L1.64.png)

It is as simple as that!

---

## Transposition in Google Sheets

In Google Sheets, the approach is nearly identical, but there is no icon to use. After copying the portion to be transposed, simply click on the cell that is to be the upper left hand corner cell of the transposed data, right click, then click on ```paste special```, and finally ```paste transposed.```

---



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Combining Datasets Together<a class="anchor" id="DS104L2_page_9"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Combining Datasets Together

Sooner rather than later in your data science career, there will come a time when you need data from more than one dataset. Thus, two (or more) datasets will need to become one! 

---

## Types of Joins

You’ve already touched on how to merge data in SQL using the joining features.  SQL is probably the best program to do data joins and similar manipulations, but it never hurts to learn how to do so in multiple languages to give yourself options.

You will now review the types of *joins*, also known as *merges*, nicely depicted with this graphic:  

![Four Venn diagrams labeled left, inner, right, and full outer. The figure has a caption joins. The Venn diagram labeled left has a shaded portion on the first full circle. The second Venn diagram is shaded only in the intersecting portion. The third Venn diagram has a shaded portion on the second full circle. The last Venn diagram is completely shaded.](Media/104.L1.16.gif)

* **Inner Join**: Gives you everything that matches in both tables.  
* **Left Join** : Provides everything that matches records from only the first dataset (the left one).
* **Right Join** : Yields everything that has matching records from the second dataset (the right one).  
* **Full Outer Join**: Provides everything that matches either dataset.

---

## Append

You can also *append*.  This is when you are basically stacking one dataset on top of each other, or laying one dataset next to each other.  You aren't connecting them by a key, so if your data is out of order at all, you will end up with a very confused and messed up dataset. 

When appending data to the end, you really only want to do this when you have all the same fields (data columns) and are only adding cases (rows).  

When appending adding columns to the side of your data, you will only want to do this when your data is all in the same order. 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Joining Datasets in R<a class="anchor" id="DS104L2_page_10"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Joining Datasets in R

Joining, or merging, will add variables to your dataset, as long as there is at least one variable in common between the two datasets.  

Here you have data about the 2018 Olympic Figure Skating Competition, spread over two datasets. One has information about the **[performances](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/performances.zip)**, and one has information about the **[aspect judging of the performances](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/judgesAspectsUnique.zip)**.  These data sets have the variable name ```performance_id``` in common, which acts as a unique key for the data.  

In the ```performances``` dataset, you have the following columns: 

![A table has nine columns with ten-row entries. The column headings are labeled performance id, competition, program, name, nation, rank, starting number, total segment score, and total element score. The last column total element score is left blank. The column program is labeled ice dance - free dance for all the entries.](Media/104.L1.18.png)

And in the other dataset, named ```judgesAspectsUnique```, you have the following columns:

![A table has twelve columns labeled aspect id, performance id, section, aspect num, aspect desc, info flag, credit flag, base value, factor, goe, ref, and scores of panel.](Media/104.L1.19.png)

To combine them, you call the ```merge()``` function, then list in the parentheses as arguments the two datasets you want to merge together, then add an additional argument of ```by=``` to create a vector of the variable or variables that you want to merge by.  This should be a variable that is unique and occurs in both datasets. 

```{r}
IceSkating1 <- merge(performances, judgesAspectsUnique, by=c("performance_id"))
```

Now, you can see, by calling the ```str()``` function, that you have the fields from both datasets in your new dataset, ```IceSkating1```, with fields from the first dataset appearing first, and fields from the second data set showing after that.  

![A command screen displays the result for the command str open bracket IceSkating close bracket. The result displays the details of name, nation, rank, starting number, total segment score, total element score, total component score, total deductions, aspect id, section, aspect num, aspect desc, info flag, credit flag, base value, factor, goe, ref, and scores of panel.](Media/104.L1.20.png)

---

## Merge as an Outer Join

You should be aware that this merge doesn’t include any data for non-matching ```performance_id``` data; meaning that it is an inner join. So, if an id shows up in your first dataset, but not your second, it wouldn’t be included, and vice versa.  The way to solve this problem is to add ```all=TRUE``` as an argument to your merge statement, like this: 

```{r}
IceSkating2 <- merge(performances, judgesAspectsUnique, by=c("performance_id"), all=TRUE)
```

![Snapshot of the lines 30, 31, 32, 33 of coding. Line 30 reads, hashtag symbol, making sure you get cases from both datasets. Line 31 left blank. Line 32 reads IceSkating 2 merge open bracket performances, judgesAspectsUnique, by equals c open bracket performance id close bracket, all equals true close bracket. Line 33 left blank.](Media/104.L1.23.png)

Now in this case, you didn’t increase the number of rows, because there just happened to be a match for all ids.  But you would typically expect an output with ```all=TRUE``` to have more rows, since you’d be able to capture more data.  When you’re not sure whether you have each unique id in both data files, better safe than sorry – just add ```all=TRUE``` and no harm done for a full outer join instead of just an inner join.

---

## Merge with Different ID Column Names

You can also perform a merge when the column names you want to merge by aren’t named the same thing.  Previously, in each figure skating data set, your unique id to merge by was ```performance_id```.  But what if the first dataset called this column ```performance_id``` and the second dataset had the same information in a column called ```id_performance```?  Not to fear, R has the solution! Simply call each variable in the ```by.``` argument of merge.  The ```by.x``` is for your initial dataset, and the ```by.y``` is for your second dataset.  

```{r}
IceSkating2 <- merge(performances, judgesAspectsUnique, by.x=c("performance_id"), by.y=c("id_performance"))
```

---

## What Happens When Columns are Named the Same, but Contain Different Data?

Lastly, what should you do when your datasets have the same variable names, but represent different things? Typically, the best way to go to be sure you get everything right and decrease confusion is to just rename them.  But there may be times when you forge ahead accidentally.  Rather than this being an “oh shoot” moment, R helps you out in your time of trouble by adding into the variable name which dataset the variable came from, as you can see here.  You ended up with two columns named ```scores_of_panel```, one in each dataset, and so R automatically added on ```.x``` or ```.y``` to inform you from which dataset that column originated.  In this case, ```x``` will always be the dataset you listed first in the merge, and ```y``` will always be the second dataset you listed in the merge. 

![A table with eleven columns labeled scores of panel x, total deductions, aspect id, section, aspect num, aspect desc, info flag, credit flag, base value, factor, goe, ref, and scores of panel y. There are nineteen-row entries. The section columns are labeled elements and components. All the entries for the column info flag labeled N A.](Media/104.L1.25.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Appending<a class="anchor" id="DS104L2_page_11"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Appending

So now you have merged data by a unique field in R. What if you just wanted to add cases (or dataset rows), and didn’t need to add any variables at all? 

The second way to join datasets is to append them.  The function is ```rbind()``` in R, which will just add cases (rows) to your dataset.  This requires the dataset have all of the variables exactly the same – same name, same variable type, etc. – otherwise, you will leave data behind.

For this example, here is some data on athlete performance, with the same four variables: ```BodyMass```, ```WorkLevel```, ```HeatOutput```, and ```Country```.  The **[first dataset has only athletes from Britain](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/muscles1.zip)**, while the **[second dataset has only athletes from Algeria](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/muscles2.zip)**.  

When you use ```rbind()```, this will add the second dataset to the first, once you list both datasets you want to combine:

```{r}
muscles3 <- rbind(muscles1, muscles2)
```

And remember, the columns must be identical.  If they aren’t, you’ll get a friendly reminder by way of error message: 

![A snapshot of two lines of source code. The first line reads, error in rbind open bracket deparse.level, … close bracket. The second line reads, number of columns of arguments do not match.](Media/104.L1.21.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Joining Datasets in R<a class="anchor" id="DS104L2_page_12"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

This Activity will **not** be graded, but you are encouraged you to complete it. The best way to become a great data scientist is to practice! Please submit either screenshots of your code or your actual R script.  

---
## Requirements

For this data exercise, you will be putting your data combination knowledge into practice! Below you’ll find links to two different data sets about the Zika virus outbreak.  The first, “ZikaColombia” has information about the outbreak in Colombia, while the second, “ZikaUS” has information about Zika in the U.S.  It’s your job to combine them into one dataset in R, so that these two countries can be analyzed together.    

**[ZikaColombia](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/ZikaColombia.zip)**

**[ZikaUS](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/ZikaUS.zip)**
1)  Should you use the merge or append function? 

-  Merge
-  **Append**

2) Perform your chosen function on the dataset.

---

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Activity Solution<a class="anchor" id="DS104L2_page_13"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Activity Solution

1. Use the append function.

2. Here is how to append these datasets in R:

```{r}
Zika <- rbind(ZikaColombia, ZikaUS)
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Merging Datasets in Python<a class="anchor" id="DS104L2_page_14"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Merging Datasets in Python

As with R, there are many different ways to combine data sets in Python.  The first is a simple merge.  After you have installed the ```pandas``` package, all you need to do is call ```pd.merge()``` function and specify your datasets. Python will do the rest! There isn’t even a need to specify a common variable between the sets like there is in R: 

```python
IceSkating = pd.merge(judgesAspectsUnique, performances)
```

---

## Specifying the Common Variable

If you did feel the urge to specify the common variable, however, that is easily done with the ```on=``` argument:

```python
IceSkating = pd.merge(judgesAspectsUnique, performances, on='performance_id')
```

And if those keywords don’t match up – no need to sweat it! Python also is set up to use common fields with different names, by specifying ```left_on=``` and ```right_on=```: 

```python
IceSkating = pd.merge(judgesAspectsUnique, performances, left_on='performance_id', right_on='id_performance)
```

---

## What Happens When Columns are Named the Same, but Contain Different Data?

Lastly, just like in R, Python can also handle overlapping column names that represent different things.  Although it is wise to try to catch these beforehand and rename as necessary, if you do overlook columns named the same thing in different datasets, Python will automatically add ```_x``` or ```_y``` to the column names to show from which dataset the data was derived.  As demonstrated below, when you have two columns named ```scores_of_panel```, Python will give you a hand: 

![A table has ten columns with two-row entries. The column headings are labeled scores of panel x, competition, program, name, nation, rank, starting number, total segment score, total element score, and scores of panel y.](Media/104.L1.30.png)

---

## Specifying the Type of Join

Python allows you to specify the types of joins you want, unlike R.  If you wanted to do an inner join, which would only select what’s in common between data sets, use the ```how=``` argument, and specify ```inner```. For outer joins, which will select everything, and place as missing (NaN) in Python anything that doesn’t overlap, you’ll specify ```outer```. You can also specify ```how=left``` or ```how=right```.  Here is what all four of those would look like: 

```python
IceSkating = pd.merge(judgesAspectsUnique, performances, how='inner')
IceSkating = pd.merge(judgesAspectsUnique, performances, how='outer')
IceSkating = pd.merge(judgesAspectsUnique, performances, how='left')
IceSkating = pd.merge(judgesAspectsUnique, performances, how='right')
```

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 15 - Appending<a class="anchor" id="DS104L2_page_15"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Appending

The way to append data in Python is by using the concatenate function in ```pandas```: ```pd.concat```.  All that *concatenation* means is to stick things together, so this function will stick your data together, one on top of the other.  Here’s an example with the same muscles dataset you used on the previous page with R: 

```python
Muscles3 = pd.concat([Muscles1, Muscles2])
```

As you can see, when this command is run, it stacks with your first dataset on top (Country is Britain) and your second dataset on bottom (Country is Algeria).  You’ll note that this just puts one on top of the other – it doesn’t even change the indexing.  

![A table has five columns and 13 rows. The row entries are as follows. Row 1, 15, 60.5, 56.0, 347, Britain. Row 2, 16, 61.9, 13.0, 186, Britain. Row 3, 17, 61.9, 19.0, 216, Britain. Row 4, 1, 61.9, 34.5, 265, Britain. Row 5, 19, 61.9, 43.0, 306, Britain. Row 6, 20, 61.9, 56.0, 348, Britain. Row 7, 21, 66.7, 13.0, 209, Britain. Row 8, 22, 66.7, 43.0, 324, Britain. Row 9, 23, 66.7, 56.0, 352, Britain. Row 9, 0, 76.2, 156.8, 3398, Algeria. Row 10, 1, 71.3, 114.1, 2988, Algeria. Row 11, 2, 69.6, 142.6, 3048, Algeria. Row 12, 3, 58.0, 142.6, 2781, Algeria.](Media/104.L1.32.png)

You can get notified that the indices overlap if you add in the argument ```verify_integrity=True``` to your ```pd.concat()``` function:

```python
Muscles4 = pd.concat([Muscles1, Muscles2],verify_integrity=True)
```

The result, when you do have overlapping indices like up above, is an error, which looks like this: 

![A snapshot of a line of source code that reads, value error indexes have overlapping values int64index open bracket open square bracket 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 close square bracket, dtype int64 close bracket.](Media/104.L1.34.png)

Note that it will specify which indexes overlap in the error message, if you need to target them specifically for some reason.  But in this case, the easiest way to fix this is to just ignore the index, using the argument ```ignore_index=True```, which will basically create a new one.  Here’s how that works: 

```python
Muscles5 = pd.concat([Muscles1, Muscles2], ignore_index=True)
```

And so now you can see that instead of restarting the index again at zero once you switch to our second dataset that was appended, it continues on to number 24.

![A table has five columns and 10 rows. The row entries are as follows. Row 1, 20, 61.9, 56.0, 348, Britain. Row 2, 21, 66.7, 13.0, 209, Britain. Row 3, 22, 66.7, 43.0, 324, Britain. Row 4, 23, 66.7, 56.0, 352, Britain. Row 5, 24, 76.2, 156.8, 3398, Algeria. Row 6, 25, 71.3, 114.1, 2988, Algeria. Row 7, 26, 69.6, 142.6, 3048, Algeria. Row 8, 27, 58.0, 142.6, 2781, Algeria. Row 9, 28, 74.6, 142.6, 2912, Algeria. Row 10, 30, 68.9, 128.3, 3135, Algeria.](Media/104.L1.35.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 16 - Combining in Python Activity<a class="anchor" id="DS104L2_page_16"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

```c-lms
activity-type: project
activity-name: Combining in Python
points: 0
due-at: 14%
close-at: end-of-module
```

This Activity will **not** be graded, but we encourage you to complete it. The best way to become a great data scientist is to practice! Please submit either screenshots of your code or your actual Python file.  

---
## Requirements

For this data exercise, you will be putting your data combination knowledge into practice! Below you’ll find links to two different data sets about spy planes.  One has the plane specifications (“PlaneFeatures”), and the other has some registration information about the potential spy plane candidates (“PlaneCandidates”).  It’s your job to combine them together, so that we can see all the data together in the same frame.     

**[PlaneFeatures](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/PlaneFeatures.zip)**

**[PlaneCandidates](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/PlaneCandidates.zip)**

1)  Should you use the merge or append function? 

-  **Merge**
-  Append

2) Perform your chosen function on the dataset.


<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 17 - Combining in Python Activity Solution<a class="anchor" id="DS104L2_page_17"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

---
# Activity Solution

1. Use the merge function.

2. Here's how to merge these two datasets in Python: 

```python
import pandas as pd

PlaneFeatures = pd.read_excel('C:/Users/meredith.dodd/Documents/New Curriculum/104 L1/PlaneFeatures.xlsx')
PlaneFeatures.head()

PlaneCandidates = pd.read_excel('C:/Users/meredith.dodd/Documents/New Curriculum/104 L1/PlaneCandidates.xlsx')
PlaneCandidates.head()

Planes = pd.merge(PlaneFeatures, PlaneCandidates, on='adshex')
Planes.head()
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 18 - Aggregating Data<a class="anchor" id="DS104L2_page_18"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Aggregating Data

Another excellent thing to know when data wrangling is how to aggregate your data.  This allows you to not only group your data by a particular variable, but perform operations on it such as sum or average.  This is often known as *grouping* as well.

---

## Aggregating Data in R

Here is **[some data that provides cases per state](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/states.zip)** - but the states are not unique; you have multiple states listed. 

Here’s how to do this in R for the function sum:

```{r}
states2 <- aggregate(Cases~State, states, sum)
```

As always, to be safe, you want to save a new data set in case something weird happens or you want to go back to the original, so you’ll type in the new data set name before the arrow, then use the function ```aggregate()```.  The variable that goes before the ```~``` (pronounced tilde) is the variable that you want to perform the operation on, and the variable that comes after is what you want to group the first variable by.  Then you specify the name of the old data set, and the operation you want to use. 

The command above takes our data from this, where each row may have several instances of a state.

![A table has two columns and sixteen-row entries. The column headings are state and cases. The row entries are as follows. Row 1, Michigan, 5. Row 2, Michigan, 7. Row 3, Michigan, 6. Row 4, Michigan, 4. Row 5, Michigan, 6. Row 6, Michigan, 2. Row 7, Michigan, 4. Row 8, Michigan, 9. Row 9, Vermont, 2. Row 10, Vermont, 2. Row 11, Vermont, 6. Row 12, Vermont, 2. Row 13,Vermont, 7. Row 14, West Virginia, 3. Row 15, West Virginia, 7. Row 16, West Virginia, 15.](Media/104.L1.37.png)

To this, where each row is now a unique state, and you have summed the number of cases by state. 

![A table has two columns and five-row entries. The column headings are state and cases. The row entries are as follows. Row 1, Georgia, 49. Row 2, Michigan, 43. Row 3, Tennessee, 38. Row 4, Vermont, 254. Row 5, West Virginia, 88.](Media/104.L1.38.png)

---

## Operations for Aggregate

It’s a handy trick for when employers want a snapshot of data quickly.  And the best part is, you can do it with multiple different operators, to cover most of your basic descriptive statistics: 

```{r}
states2 <- aggregate(Cases~State, states, sum)
states3 <- aggregate(Cases~State, states, mean)
states4 <- aggregate(Cases~State, states, median)
states5 <- aggregate(Cases~State, states, min)
states6 <- aggregate(Cases~State, states, max)
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 19 - Aggregating Data in Python<a class="anchor" id="DS104L2_page_19"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Aggregating Data in Python

You can aggregate data very similarly in Python in ```pandas``` by using the ```groupby()``` function, as shown below, where ```states``` is the dataset name and the variable in ```( )``` is your grouping variable, while the variable in ```[ ]``` is what you what to perform the operation on. For instance:

```python
states.groupby('State')['Cases'].sum()
```

And just like in R, you can use a variety of different operations with the ```groupby()``` function: 

* .sum (shown above)
* .mean
* .median
* .max
* .min

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 20 - Pivot Tables<a class="anchor" id="DS104L2_page_20"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Pivot Tables

A pivot table is one of the most useful tools in MS Excel. Pivot tables can take large data sets and summarize them in a variety of ways. Pivot tables also have a lot of complexity and capability. Briefly, a pivot table is used to summarize large datasets and creating small manageable tables. 

Once you have created a pivot table, you can provide it to clients, who can then easily examine the relationships between many different variables at the click of a button.  If you have somewhat data-savvy clients, who really want a deeper-dive, making use of a pivot table can save you some time.  It will allow them to play with the data themselves rather than continually asking you for piddly group by requests daily.

Here is [the first helpful video](https://www.youtube.com/watch?v=qu-AK0Hv0b4) on pivot tables; here's [another one](https://www.youtube.com/watch?v=9NUjHBNWe9M).

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 21 - Key Terms<a class="anchor" id="DS104L2_page_21"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms 

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Wide data</td>
        <td>Cases are shown as different columns.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Long data</td>
        <td>Cases are shown as different rows.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Transposing</td>
        <td>Changing data from long to wide or vice versa.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Transformation</td>
        <td>Changing the shape of the data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Merge/Join</td>
        <td>Combining datasets together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Inner Join</td>
        <td>Provides everything that is a match in both datasets.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Left Join</td>
        <td>Provides everything that is a match to the first dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Right Join</td>
        <td>Provides everything that is a match to the second dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Full Outer Join</td>
        <td>Provides every record that matches either dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Append</td>
        <td>Stacking data side by side or on top of one another without a matching key.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pivot Table</td>
        <td>A way to summarize data in MS Excel and allow clients to interactively examine their data.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>t()</td>
        <td>Transposes data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>class()</td>
        <td>Determines the format of the data placed in the parentheses.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>as.data.frame()</td>
        <td>Changes other data formats into data frames.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>gsub()</td>
        <td>Replaces a word with a substitute.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>merge()</td>
        <td>A function that joins data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>by=</td>
        <td>An argument to merge() that specifies the key by which to join.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>all=TRUE</td>
        <td>An argument to merge() in which you do a full outer join, getting every record, regardless if it matches in the other dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>by.x= and by.y=</td>
        <td>Arguments to merge() that allow you to have matching keys with different variable names.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>rbind()</td>
        <td>Allows you to add rows to your dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>aggregate()</td>
        <td>A function that allows you to group like data together and look at a summary statistic like sum or mean. Statistics allowed as arguments include: sum, mean, median, min, and max.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.T</td>
        <td>Transposes your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pd.merge()</td>
        <td>A function in the pandas package that joins data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>on=</td>
        <td>An argument in pd.merge() that allows you to specify the variable upon which to join.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>left_on= and right_on= </td>
        <td>An argument for pd.merge() that allows you to specify a different name for the unique key in each dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>how=</td>
        <td>An argument to pd.merge() that allows you to specify the type of join.  Choose inner, outer, left, or right as values. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pd.concat()</td>
        <td>A function in the pandas package that allows you to append data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>verify_integrity=True</td>
        <td>An argument to pd.concat() that shows you whether you have overlapping indices or not.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ignore_index=True</td>
        <td>An argument to pd.concat() that allows you to ignore the index of the data and creates a new index.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>groupby()</td>
        <td>A function that allows you to aggregate data and utilize a summary statistic.  Statistics options available as functions include: .sum(), .mean(), .median(), .max(), and .min().</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 22 - Data Transformation Hands-On<a class="anchor" id="DS104L2_page_22"></a>

[Back to Top](#DS104L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Data Transformation Hands-On

In this hands-on, you will be using R, Python, or a combination of both programs to analyze data on the airbnb website. This Hands-On will be graded.  The best way to become a data scientist is to practice!

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

You are working for Airbnb, and they are trying to improve their website.  They've collected **[data by unique id on gender, signup method, language, affiliates, devices, and browsers of website visitors.](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/airbnb_test_users.zip)**.  

They'd like to know the following: 

* What is the average ```age``` of those who use each web browser type?
* What is the total ```signup_flow``` for each device? 

Make sure you use your newfound data aggregation skills to find the answer.

They would also like you to perform the following tasks: 

* They need the ```country``` information from  **[this dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/airbnb_sample_submission.zip)** included in the ```airbnb_test_users``` file. 
* Add additional users to the test file from **[this dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/airbnb_users.zip)**. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Make sure all columns are the same.</p>
    </div>
</div>

Please annotate your code to explain each step and answer each question, then attach your .ipynb and/or R script file, so your work can be graded. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>