# DS104 Data Wrangling and Visualization : Lesson Three Companion Notebook

### Table of Contents <a class="anchor" id="DS104L3_toc"></a>

* [Table of Contents](#DS104L3_toc)
    * [Page 1 - Introduction](#DS104L3_page_1)
    * [Page 2 - What is Recoding? ](#DS104L3_page_2)
    * [Page 3 - Dummy Coding - A Special Kind of Recode](#DS104L3_page_3)
    * [Page 4 - Recoding into a New Variable in Python](#DS104L3_page_4)
    * [Page 5 - Grouping with a Recode in Python](#DS104L3_page_5)
    * [Page 6 - Recoding from Continuous to Categorical in Python](#DS104L3_page_6)
    * [Page 7 - Recoding into the Same Variable in Python](#DS104L3_page_7)
    * [Page 8 - Dummy Coding in Python](#DS104L3_page_8)
    * [Page 9 - Recoding into a New Variable in R](#DS104L3_page_9)
    * [Page 10 - Recoding into the Same Variable in R](#DS104L3_page_10)
    * [Page 11 - Missing Data](#DS104L3_page_11)
    * [Page 12 - Key Terms](#DS104L3_page_12)
    * [Page 13 - Recoding Hands-On](#DS104L3_page_13)
    * [Page 14 - Recoding Hands-On Solution](#DS104L3_page_14)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS104L3_page_1"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction
VimeoVideo('388859063', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO104L03overview.zip)**.

# Introduction

Even if you have all the right columns and rows, and your data's in the right shape, thing still might not run or make sense.  That's where *recoding* comes in - you are able to take the data you have and code it so that it represents different information or becomes a different data type.  You may take your data from continuous to categorical, or from string to numeric.  

By the end of this lesson, you should be able to: 

* Understand best practices in recoding
* Comprehend dummy coding and when it is utilized
* Recode into a new variable
* Recode into the same variable
* Group data by recoding
* Dummy code data

This lesson will culminate with a hands on in which you will recode data on worldwide eating habits in both Python and R. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/443250851"> recorded live workshop </a> that goes over the material in this lesson using Python or <a href="https://vimeo.com/449951217"> this recorded live workshop </a> that goes over the material in R. </p>
    </div>
</div>


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction
VimeoVideo('443250851', width=720, height=480)

In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction
VimeoVideo('449951217', width=720, height=480)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is Recoding? <a class="anchor" id="DS104L3_page_2"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is Recoding? 

Another large part of data wrangling is *recoding* data.  Recoding is when you change the information in a variable in some way, so that it is represented differently.  For instance, some machine learning algorithms in Python won't allow you to have any string (character) data.  But, you may still want to use this information for the learning algorithm - it just needs to change format.  How do you do that? Recoding! You can put numbers as stand-ins for character values.  

Take an example variable that has information about dogs' fur colors.  Values in this variable might be brown, cream, and black.  Having those written out as strings just won't fly! But you can put numeric stand-ins for these values.  Usually, you will want to start with 0 or 1, just to make things easier, rather than starting with any random number.  So, make brown be 0, cream be 1, and black be 2.  Now you have recoded them, retaining their meaning, but they are now fit for machine learning.

---

## Recoding to Analyze Data Differently

There may also be times when you'd like to examine your data in different ways, but didn't collect data in a specific way.  For instance, you could analyze the feedback data that Woz-U is collecting from everyone at the end of their modules to determine where the largest changes need to be made.  That data automatically comes in tagged by module (i.e. DSO101, DSO102, etc.), but you would like to look at it by the three major programs: Cyber Security, Data Science, and Full Stack Web Development.  So, you could recode the data so that all ten modules for Data Science get tagged with a 1 to stand for Data Science, all ten Cyber Security modules get tagged with a 2 for Data Science, and all ten Full Stack modules get tagged with a 3. Now it is easy to compare the differences between programs, without the information about the modules cluttering things up.

---

## Recoding into a New Variable vs. Rewriting your Data

When recoding data, you will always have the option of whether to recode into a new variable, thus retaining the original column of data, or whether to replace the data.  Each has a place.  If you are trying to recode for an analysis like machine learning algorithms in Python that will not accept ANY string data, then saving a copy of your dataset and then recoding into the original columns is probably the best way to go.  But if you are running statistics in R, which will only analyze the columns you ask it to, then for safety's sake and ease of use, you will probably want to recode into a new variable.  

---

## Naming Conventions for Recoding

Especially if you are recoding into a new variable, it is important to have some sort of typical naming convention to indicate a recoded variable that you and your colleagues will always use.  This will help differentiate between the old variable and the new one and will eliminate confusion and mix ups, since some analyses will run data uncoded, but will produce incorrect results! A good standard to use is to place a capital ```R``` at the end of the variable name, standing for recode, so anyone who's looking at your data knows this is recoded. If it is something that went from multiple categories to a "yes" or "no" or a zero and one placeholder, then you can also place ```YN``` at the end, to ensure there isn't any confusion with other types or recodes.  However, it does not matter what naming convention you use, so long as you are consistent and clear, and you communicate your conventions to your colleagues. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>When you only have two categories, like "Yes" and "No", this is called a dichotomous variable, from the root "di" meaning two! </p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Dummy Coding - A Special Kind of Recode<a class="anchor" id="DS104L3_page_3"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Dummy Coding - A Special Kind of Recode

When you get to analyses like regression or analyses of covariance, you will find that they require a special kind of recoding called *Dummy Coding* in order to use categorical data.  When you put categorical variables into these types of analyses, they can only have two levels - either something happened / is present or it did not happen / is not present.  Usually when something is present, you will recode it as a 1, and when something is not present, you will recode it as a zero.  That's all well and good, but what happens if you have a categorical variable that has more than two levels? You need to create additional variables, all of which will have two levels only.

Take an example like race.  There are many categorical levels to race.  For instance, if you have the following: 

* Caucasian
* African American
* Asian
* Native American
* Pacific Islander

There's five levels there, and for regression, you can't have more than two! So, you will create five new variables, each recoded as a zero or as a one.  They would look like this: 

* Caucasian : 0 or 1
* African American : 0 or 1
* Asian : 0 or 1
* Native American : 0 or 1
* Pacific Islander : 0 or 1

You can then continue along your merry way in your regression analyses, and include all five variables in your model to capture race.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>Dummy Coding is sometimes called base coding as well.</p>
    </div>
</div>

---
<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>

If you want to dummy code more than one variable or all categorical variables at one time, you can! The syntax is a follows:

<b>Multiple Categorical</b>
```{r}
variablesdf = pd.get_dummies(df, ['column1','column2','column3'])
```

<b>All Categorical Variables</b>

```{r}
df = pd.get_dummies(df)
```
</div>
    <div class="panel-body">
        <p>This function saves you a step by dropping the original string columns that are for you!</p>
    </div>
</div>
 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Recoding into a New Variable in Python<a class="anchor" id="DS104L3_page_4"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Recoding into a New Variable in Python

Though there are many ways to recode a new variable, if you are going to be recoding multiple columns in similar ways, then the best way is to write and define a function that you can apply to several different variables easily. An example of when this type of recode might come in handy is if you asked multiple different survey questions and they all had the same response options of ```Strongly Disagree```, ```Disagree```, ```Neutral```, ```Agree```, and ```Strongly Agree```.  If you had 20 questions that all needed to be recoded, then it would be a pain to recode each one of those separately.  

You'll use a **[dataset](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/superheroes.zip)** with information about superheroes. You can create a function named ```gender``` and define it as such: 

```python
def gender (series): 
    if series == "Male":
        return 0
    if series == "Female": 
        return 1
```

The above is just a set of ```if``` statements, where if your data has the value ```Male``` then it will become a ```0```. 

Once that has been done, you can apply the recode function, ```gender()``` that you just created, using the ```apply()``` function.  Putting a new column name in ```[]``` before the equals sign means that this will put your applied function into the ```superheroes1``` dataframe into a new column named ```GenderR```.  

```python
superheroes1['GenderR'] = superheroes1['Gender'].apply(gender)
```

The result is a new column called ```GenderR``` at the end of your data frame that has 0s for males and 1s for females.

![A table has thirteen columns and ten-row entries. The column headings are labeled unnamed 0, name, gender, eye color, race, hair color, height, publisher, skin color, alignment, weight, and gender.](Media/Recode1.png)

You could apply the function ```gender()``` over and over again if you had more than one column that needed these particular values recoded.  That is an unlikely scenario with something like gender, but may be very common with something like survey responses.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Grouping with a Recode in Python<a class="anchor" id="DS104L3_page_5"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Grouping with a Recode in Python

Another thing that is best to do with a recode into a new variable is grouping.  For instance, you'll notice the the ```Publisher``` column in the ```superheroes1``` dataset shows names of producers of comic books, movies, TV series, etc.  What if you really wanted to focus on the type of media, whether the name? If you wanted to know whether you had a Comic, a show or movie, or a book? Well, you could assign the appropriate values to the publisher column, so that you have effectively created the three categories of: 

* Comic
* Screen
* Book 

Take a look:

```python
def media (series): 
    if series == "Marvel Comics" : 
        return "Comic"
    if series == "DC Comics": 
        return "Comic"
    if series == "NBC - Heroes" : 
        return "Screen"
    if series == "Dark Horse Comics" : 
        return "Comic"
    if series == "George Lucas" : 
        return "Screen"
    if series == "Image Comics" : 
        return "Comic"
    if series == "HarperCollins" : 
        return "Book"
    if series == "Star Trek" : 
        return "Screen"
    if series == "SyFy" : 
        return "Screen"
    if series == "Team Epic TV" : 
        return "Screen"
    if series == "IDW Publishing" : 
        return "Book"
    if series == "ABC Studios" : 
        return "Screen"
    if series == "Shueisha" : 
        return "Unknown"
    if series == "Icon Comics": 
        return "Comic"
    if series == "Wildstorm" : 
        return "Unknown"
    if series == "Sony Pictures" : 
        return "Screen"
    if series == "South Park" : 
        return "Screen"
    if series == "J.R.R. Tolkein" : 
        return "Book"
    if series == "Universal Studios" : 
        return "Screen"
    if series == "Rebellion" : 
        return "Unknown"
    if series == "Titan Books" : 
        return "Book"
    if series == "Hanna-Barbera" : 
        return "Screen"
    if series == "Microsoft" : 
        return "Unknown"
    if series == "J.K. Rowling" : 
        return "Book"

superheroes1['Media'] = superheroes1['Publisher'].apply(media)
```

And now you have a new variable, showing different information than publisher, that you could do useful things with, like graph frequencies.  By the way, graphing the frequencies of a group is also a pretty good way to check to make sure you didn't miss anything in a recode. 

![A snapshot of a window displays two lines of source code and a bar chart. The source code reads In open square bracket 55 close square bracket 1 superheroes1 open square bracket media.value underscore counts open and close brackets.plot open bracket bar close bracket. Out open square bracket 55 close square bracket open angular bracket matplotlib.axes. underscore subplots.AxesSubplot at 0x261f828fc88. A bar chart represents five parts labeled comic, screen, book, and unknown on the x-axis and the numbers range from 0 to 600 on the y-axis. The comic bar is the highest and the unknown bar is the lowest.](Media/Recode6.png)

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To make sure you're including all possible values, make use of the .value_counts() function, which will list out all the values and their frequencies. </p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Recoding from Continuous to Categorical in Python<a class="anchor" id="DS104L3_page_6"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Recoding from Continuous to Categorical in Python

Sometimes, you'll want to take some continuous data and artificially make it categorical. While this does have the potential to bias your data a little bit, because you're not using it in it's original intended format, it can be a useful way to examine things in more detail, and you will find that the layperson, like your employer, may often ask you to break your continuous data down into groups.  Therefore, there is value to learning how!

You can use the same kind of recode format as you did to recode things into a new variable.  However, you'll use your operands such as greater than, less than, and greater than or equal and less than or equal to denote your groups.

---

## An Example

For instance, say you have an interest in determining how many of the superheroes are over five feet tall, and thus want to create a new categorical variable that is "Yes, over five feet," or "no, not over five feet." You can do this by recoding the continuous variable ```Height``` into categories: 

```python
def height_recode (series): 
    if series < 152.4:
        return 0
    if series >= 152.4: 
        return 1

superheroes1['Height_5ftYN'] = superheroes1['Height'].apply(height_recode)
```

Since ```Height``` is in centimeters, you'll need to first find out how many centimeters five feet is: 152.4.  Then you can do one ```if``` statement for anything less than that amount, and another for anything greater than or equal to that amount.   Note that it is important that these categories don't overlap in any way, so you wouldn't want to say less than or equal when you are using greater than or equal.

The resulting new variable now has things grouped by greater than or less than five feet.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Recoding into the Same Variable in Python<a class="anchor" id="DS104L3_page_7"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Recoding into the Same Variable in Python

If you want to recode into the same variable, then you can do this using the ```.replace()``` function.  Simply create a dictionary for all the variables and values you want to recode using a ```key:value``` pair format, and then replace everything all at once. 

There are some things to be aware of when using this method, however: 

* You can only run the cell that use the ```.replace()``` in once.  So, if you make any mistakes, you will need to go back to a previous version of your dataset, then correct the recoding dictionary, and then use ```.replace()``` again.  This can be a pain in the butt. Double check everything, including commas, colons, and brackets, before running.

* This does **permanently replace** all your data.  So you don't have options to use another variable or go back very easily. If you are wanting to change the level at which you're looking at data, recoding into a different variable first is strongly recommended.

* It can be a little difficult to read if you have a lot of values for each variable.

So, in the coding below, you can see that the dictionary is called ```cleanup```, and you are recoding 7 different variables all at once: ```Gender, EyeColor, Race, HairColor, Publisher, SkinColor, and Alignment```.  Curly braces surround the entire dictionary, and then you will place the variable name in quotes followed by a colon.  After the colon, you'll use a second set of curly braces to define the value pairs.  First, place the original name in quotes, followed by a colon for what the new value should be.  If this new value is a string (name), then it will need to be in quotes, but if you are recoding to a number like this example, then no quotes are necessary. You will separate each value pair with a comma for a particular variable, and end with a curly brace.  To add more variables, simply add a comma after the curly brace and repeat! 

It's probably a lot easier to show you: 

![Source code for the process of cleanup is generated for the category gender. The codes are set for eye color, race, hair color, publisher, skin color, and alignment.](Media/Recode2.png)

See how all your variable (column) names pretty much lineup, and that after the colon, you start defining a list of old to new values? It's easy to see with something that has only a few values like ```Gender``` or ```Alignment```, but once you get into variables like ```Race``` or ```HairColor```, there are a lot of options and you'd have to scroll a lot to see everything.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To make sure you're including all possible values, make use of the .value_counts() function, which will list out all the values and their frequencies. </p>
    </div>
</div>

Once you run that cell, feel free to examine your data again.  Make sure that everything worked as expected. It is highly likely you'll find a boo-boo in there somewhere.  In the recode you just did, something is deliberately amiss, so you can learn from it.  Look here: 

![A window with source code that reads in open square bracket 34 close square bracket 1 superheroes1.head open and close brackets. Out open square bracket 34 close square bracket. A table has twelve columns labeled unnamed 0, name, gender, eye color, race, hair color, height, publisher, skin color, alignment, weight, and gender.](Media/Recode3.png)

Just looking at the head of your data, you should spot something: Although everything should be numeric for the columns ```Gender, EyeColor, HairColor, Publisher, SkinColor, and Alignment```, in the Publisher column, ```Marvel Comics``` sticks out like a sore thumb.  If you look back at the recode image, you'll see that you recoded ```Marvel``` and not ```Marvel Comics```. 

So how can you fix this error? Well, you will need to re-run an earlier cell, either to import your data or to generate a previous version of the data set.  In your case, you can go back up to where you created ```superheroes1``` and then fix the cleanup to read ```Marvel Comics``` and then look at the data once more. 

![Source code for the process of cleanup is generated for the category gender. The source code reads, In open square bracket 39 close square bracket. The codes are set for eye color, race, hair color, publisher, skin color, and alignment. In open square bracket 40 close bracket 1 superheroes1.head open and close brackets. A table has twelve columns labeled unnamed 0, name, gender, eye color, race, hair color, height, publisher, skin color, alignment, weight, and gender.](Media/Recode4.png)

And voila! Fixed.  But definitely with some extra steps necessary. If you try and run this again on the same data, you'll end up with this error message on the bottom: 

![Source code reads, TypeError: cannot compare types ndarray open bracket dtype equals int 64 close bracket and str.](Media/Recode5.png)

So if you do make a mistake and try to re-run, hopefully you'll remember you need to go back to a previous data version before attempting to correct the error.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Dummy Coding in Python<a class="anchor" id="DS104L3_page_8"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Dummy Coding in Python

```pandas``` has an awesome function for dummy coding in Python, called ```pd.get_dummies```. You'll feed it the dataset name and variable name for the column you want to dummy code, and it will create a data frame with the columns for that variable all set up as zeroes and ones.  Here's the code: 

```python
AlignmentDummy = pd.get_dummies(superheroes1['Alignment'])
```

Which will produce this data frame: 

![Source code reads, In open square bracket 57 close bracket 1 alignmentdummy. Out open square bracket 57 close square bracket. A table has four columns labeled -, bad, good, and neutral with first five and last five rows. Also optional line to recode - to Unknown](Media/Recode7.png)

If you would like to change the column name ```-``` to something more descriptive, you can use the code above to change the column name ```Unknown```.

Having this information is nice, but it would be much more useful if you added it back into your original ```superheroes1``` data frame.  Use your newfound concatenation skills with ```pd.concat```, and set the ```axis=``` argument to ```1``` so that it adds them as columns, not rows.  

```python
superheroes2 = pd.concat([superheroes1, AlignmentDummy],axis=1)
```

Now you can see that you have four new columns in your data frame: ```Unknown, bad, neutral, and good```, and they are each populated with zeros and ones.

![A table has seventeen columns and five-row entries. The column headings are labeled name, gender, eye color, race, hair color, height, publisher, skin color, alignment, weight, gender R, media, Unknown, bad, good, and neutral.](Media/Recode8.png)

Now you are ready to handle analyses like regression and machine learning functions that require dummy coding!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Recoding into a New Variable in R<a class="anchor" id="DS104L3_page_9"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Recoding into a New Variable in R

In order to record into a new variable in R, first, you will need to create a new, but empty variable.  Assigning values as ```NA``` is how you'll do that: 

```{r}
superheroes1$GenderR <- NA
```

Then, you'll fill that new column in! Specify the name of the dataset and new variable first, then in square brackets, place the old variable name and the old variable values.  After the arrow, you will place the new variable value.

```{r}
superheroes1$GenderR[superheroes1$Gender=='Male'] <- 0
superheroes1$GenderR[superheroes1$Gender=='Female'] <- 1
```

Run it and ta da! There's that new variable!

![A table has ten columns labeled gender, eye color, race, hair color, height, publisher, skin color, alignment, weight, and gender R. The row entries are for eight males.](Media/Recode9.png)

You can use this same method to group things in a recode, like you did in the previous Python example.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Recoding into the Same Variable in R<a class="anchor" id="DS104L3_page_10"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Recoding into the Same Variable in R

There are far fewer times in R when it will be absolutely necessary to recode into the same variable.  To be on the safe side, for R, it's recommended to recode into a new variable.   But if you really want to, or come up against a situation where you need to recode into the same variable, you can just do something similar to the above, but don't create a new variable: 

```{r}
superheroes1$Gender[superheroes1$Gender=='Male'] <- 0 
superheroes1$Gender[superheroes1$Gender=='Female'] <- 1 
```

Beware! Sometimes you'll run into variable type conflicts when you try and recode into the same variable in R.   This warning looks like this: 

![A warning message reads, In open square bracket factor open bracket asterisk tmp asterisk, superheroes dollar symbol Gender equals equals Male, value equals c open bracket NA, : invalid factor level, NA generated.](Media/Recode10.png)

This is telling you that the variable ```Gender``` was previously a string or a factor, and so when you recode it, it can't make it numeric, so it turns everything to missing! Uh oh! This is definitely a warning you will want to heed, as you've gone to having ```gender``` populated with information to having the ```gender``` column be completely blank, showing only ```NA```! 

![A table has ten columns with eight-row entries. The column headings are labeled X, name, gender, eye color, race, hair color, height, publisher, skin color, and alignment.](Media/Recode11.png)

If you haven't previously saved a copy of your dataset you can go back to, you'll need to re-import the whole thing. Yet another reason why you should always rename your datasets as you do new things - so you have an easier way to revert!

---

## Addressing the Warning Message

This particular waring message problem can be fixed in an easy way - just unclick the ```strings to factors``` button when you import the CSV in R. By default R checks this, but it can often get in the way.

![A window labeled import dataset on the title bar. The left panel has a name field, a dropdown list box labeled encoding, two radio buttons labeled yes and no for heading field, five dropdown list boxes labeled row names, separator, decimal, quote, and comment. A checkbox labeled strings as factors is placed at the bottom of the panel. Next, the screen has top and bottom panels labeled input file and data frame. Two buttons labeled import and cancel are placed at the bottom of the window.](Media/Recode12.png)

Once that's done, you can now recode into the same variable just fine. Now ```gender``` is just as you coded it to be.

![A table has ten columns with eight-row entries. The column headings are labeled X, name, gender, eye color, race, hair color, height, publisher, skin color, and alignment.](Media/Recode13.png)

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Want to see additional recode examples in R? Want to tackle all your recodes in R at once? </h3>
    </div>
    <div class="panel-body">
        <p>Then check out this<b><a href="https://www.theanalysisfactor.com/r-tutorial-recoding-values/"> website.</a></b></p>
    </div>
</div>

---

## Dummy Coding in R

Below, this function in the ```fastDummies``` package dummy codes and concatenates new columns to original data frame automatically. Nifty!

```{r}
install.packages("fastDummies")
library("fastDummies")
superheroes <- dummy_cols(superheroes, select_columns = "Alignment")
``` 

---

## Summary 

Recoding is an important part of data wrangling, as it will prepare your data for analyses such as machine learning and regression.  There are many different types of recoding, and each have their own uses.  Recoding can also help you to extract additional meaning from your data by categorizing it in new ways.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Missing Data<a class="anchor" id="DS104L3_page_11"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Missing Data

Missing data in any dataset is almost inevitable. You will never have every single piece of data for every single person or item. Therefore, part of data wrangling and cleaning/screening of your data is to choose how to deal with your missing data, and then execute that approach. The way you deal with missing data is a very important consideration; you have the potential to bias your results if you choose the wrong method for dealing with missing data. Listed below are several approaches to missing data:

* Pairwise deletion of missing data
* Listwise deletion of missing data
* Mean imputation
* Hot deck imputation

This Python example shows you how to see how many missing values you have in each column. The cells below show show how to replace *NAN*'s with a string or impute missing values with the mean: 

![Show Missing Values](Media/missing_dataDSO104.png)

This is how you achieve the same as above in R:

![Impute Missing Values](Media/missing_data2DSO104.png)

Without further ado, let's take a look at the different approaches to missing data listed above.

---

## Pairwise Deletion of Missing Data

The concept of *pairwise deletion* is for when you only want to remove data from a particular analysis when it is missing the particular variables you are using.  For instance, if you are running an analysis with age, gender, and testing scores, but you have 10 other variables in your dataset, as you do that analysis, you will only remove subjects that have data missing on either age, gender, or testing scores.  If they have other data missing, you don't care, and run your particular analysis anyway.  

Especially in R, there are several analyses that are already set up to do pairwise deletion automatically for you, so you don't need to worry about doing anything manually. 

Using pairwise deletion causes you to have something called a *valid N*.  It means that you no longer have one sample size for an entire project, but rather may have a different *N* for every single analysis you do.  This can be confusing, and depending on your audience, you may need to report upon the valid *N* for all your work. 

---

## Listwise Deletion of Missing Data

*Listwise deletion* is when you delete an entire row when it has missing data anywhere in it.  Although this is easy, and gives you one sample size for the entire dataset, it can be problematic, because if your data has a lot of missing values, and you have a lot of columns / variables, you can end up with very few valid rows left.  If you have a dataset with 13 column variables, and 20 rows, and each row is missing a different piece of those 13 column variables, you would end up with only 7 rows left.  

You may also end up throwing out good data out with the bad. If you're missing data from a column on species, but have everything you need to conduct the analyses you want to run on age, gender, and testing scores, it would be silly to reduce your sample size! So think carefully about whether listwise deletion is really necessary.

---

## Mean Imputation

Deletion isn't the only way to handle missing data. You can also replace your missing data with other values, which is generically called *imputation*.  One of the things you can replace your missing data with is the mean of that variable.  This is thus called *mean imputation*.  While mean imputation can be useful, it can also be biased, because you end up having regression towards the mean.  Things generally get pulled more towards the average.  This means that you may be reducing your chances of finding something significant, or improving your chances of finding something significant, depending on where your mean was already headed. 

---

## Hot Deck Imputation

In order to deal with some of the biases that mean imputation imparts, the process of *hot deck imputation* was developed.  In hot deck imputation, you are still filling in data for missing values, but you are filling them in with a little more finesse.  Hot deck imputation tries to fill in values conditionally, based upon important variables.  So, if you see that most women in their 40s have a certain average body temperature, for instance, you would use that average to fill in missing data about body temperature for a women who is 43, rather than using the generic average. 

Hot deck imputation is still just making data up, but it is a slightly more educated guess than mean imputation.  It can still bias your data, and it can still be incorrect.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Key Terms<a class="anchor" id="DS104L3_page_12"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Recoding into a New Variable</td>
        <td>Creating a new variable and adding in values based on another column.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Recoding into the Same Variable</td>
        <td>Changing the values in one column into different values in the same column.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Dummy Coding</td>
        <td>Taking a categorical variable with more than one level and creating additional columns for each level that is filled with zeros and ones.</td>
    </tr>
</table>

---

## Key Python Commands and their Usage

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.apply()</td>
        <td>Applies a recode to a particular variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.value_counts()</td>
        <td>Use this command to get a list of all possible values for a particular variable, which will aid in recoding.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.replace()</td>
        <td>Recode into the same variable by replacing your variables with a dictionary of new key value pairs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pd.get_dummies()</td>
        <td>Creates dummy coding columns, which can be attached to your data frame.</td>
    </tr>
</table>

---

## Python Packages Needed for Recoding

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pandas</td>
        <td>Contains the pd.get_dummies() and .value_counts() functions.</td>
    </tr>
</table>

---

## Key R Commands and their Usage 

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>dummy_cols</td>
        <td>Creates dummy coding columns in a matrix and appends them to data frame.</td>
    </tr>
</table>

---

## R Libraries Needed for Recoding

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>fastDummies</td>
        <td>Contains the dummy_cols function.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Recoding Hands-On<a class="anchor" id="DS104L3_page_13"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Recoding Hands-On

This Hands-On will **not** be graded, but you are encouraged to complete it. The best way to become a data scientist is to practice.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

You are working for a global chocolate company, and they've collected **[data on worldwide eating habits](https://repo.exeterlms.com/documents/V2/DataScience/Data-Wrang-Visual/Eating_Habits.zip)**. Their eventual goal is to determine the demographics for chocolate-eaters worldwide.  Which countries are most likely to consume chocolate? Which gender, and which age group? Someone else will run these analyses, but it is your job to wrangle and recode the data in preparation. 

---
## Part 1: Recoding in Python

Please perform the following tasks in Python:

* Recode ```Activity``` into a new variable.  Zeros should be not eating chocolate, and 1s should be eating chocolate.
* Recode ```Frequency``` from text to numbers in the same variable.  The value zero should be the lowest frequency.
* Recode ```Sex``` from text to numbers in the same variable.
* Dummy code the ```Age``` group variable.

---
## Part 2: Recoding in R

Please perform the following tasks in R:

* Recode ```Activity``` into a new variable called ```JunkFood```. Anything that you would consider junk food, recode as a 1.  Everything else should be recoded as a zero.
* Recode ```Sex``` from text to numbers in the same variable

Please comment your code and include both your Python and R files.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>




<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Recoding Hands-On<a class="anchor" id="DS104L3_page_14"></a>

[Back to Top](#DS104L3_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

---
## Part 1: Recoding in Python

```python
import pandas as pd

EatingHabits = pd.read_csv('C:/Users/meredith.dodd/Documents/New Curriculum/Recoding/Eating_Habits.csv')
EatingHabits.head()

#Recode Activity into a New Variable

EatingHabits.Activity.value_counts()

def activity (series):
    if series == "Eating fruit" : 
        return 0
    if series == "Drinking soft drinks, cola or other drinks with sugar" : 
        return 0
    if series == "Drinking coffee" : 
        return 0
    if series == "Eating french fries" : 
        return 0
    if series == "Eating hamburgers, hot dogs or sausages" : 
        return 0
    if series == "Eating candy, chocolate bars" : 
        return 1
    if series == "Eating whole wheat or rye bread" : 
        return 0
    if series == "Eating raw vegetables" : 
        return 0
    if series == "Eating potato chips, crisps" : 
        return 0
    if series == "Drinking whole milk" : 
        return 0
    if series == "Drinking low fat milk" : 
        return 0
    if series == "Eating peanuts" : 
        return 0

EatingHabits['ChocolateYN'] = EatingHabits['Activity'].apply(activity)
EatingHabits.head()

# Recode Frequency and Sex from text to numbers in the same variable

EatingHabits.Frequency.value_counts()
EatingHabits.Sex.value_counts()

cleanup = {"Frequency" : {"Never" : 0, "Seldom" : 1, "At least once a week" : 2, "Once a day" : 3, "More than once a day" : 4}, "Sex" : {"Females" : 0, "Males" : 1}}
EatingHabits.replace(cleanup, inplace=True)
EatingHabits.head()

#Dummy code the Age group variable

AgeDummy = pd.get_dummies(EatingHabits['Age group'], drop_first=True)
AgeDummy

EatingHabits2 = pd.concat([EatingHabits, AgeDummy], axis=1)
EatingHabits2.head()
```

---
## Part 2: Recoding in R

To recode activity into a new variable called JunkFood

```{r}
Eating_Habits$JunkFood < NA
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating fruit'] <- 0
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating raw vegetables'] <- 0
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating candy, chocolate bars'] <- 1
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating potato chips, crisps'] <- 1
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating french fries'] <- 1
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating hamburgers, hot dogs or sausages'] <- 1
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating peanuts'] <- 0
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating whole wheat or rye bread'] <- 0
Eating_Habits$JunkFood[Eating_Habits$Activity=='Drinking soft drinks, cola or other drinks with sugar'] <- 1
Eating_Habits$JunkFood[Eating_Habits$Activity=='Drinking coffee'] <- 0
Eating_Habits$JunkFood[Eating_Habits$Activity=='Eating Fruit'] <- 0
```

To recode sex from text to numbers in the same variable:

```{r}
Eating_Habits$Sex[Eating_Habits$Sex=='Males'] <- 0
Eating_Habits$Sex[Eating_Habits$Sex=='Females'] <- 1
```

To dummy code the frequency variable: 

```{r}
library("fastDummies")

Eating_Habits1 <- dummy_cols(Eating_Habits, select_columns = "Frequency")
```