<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/pandas04concatenationandgrouping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 - Data Preparation Basics
## Segment 4 - Concatenating and transforming data

Concatenating and transforming
- [Instructor] Concatenation and data transformation are useful for getting your data into the structure and order you need for analysis. Concatenating is simply combining data from separate sources. Transformation, on the other hand, is converting and reformatting data to the format that's necessary for your purposes. I've gone ahead and already imported numpy in Pandas. Now, let's create a data frame object. This time we're going to create a 6x6 data frame and it's going to have values that range between zero and 35. We'll call it DF_obj and we'll say DF_obj is equal to pd.DataFrame. Then we're going to call the np.arrange method. Np.arrange and we are going to pass in value of 36. This tells Python to generate a series of values from zero to 35 and then we'll call the .reshape method off of that. We'll tell it we want six rows by six columns. Then we'll print this out. Okay, so you see here we have a 6x6 with values that range between zero and 35 and they are in sequential order. Right, so let's create another data frame object that we can use to practice our concatenation and transformation. The second object, we'll call it DF_obj_2. Then we will, again, call our data frame constructor. This time we're going to create a series of numbers from zero to 14. So we'll say np.arange and then we'll pass in the value of 15. We want to shape this as a 5x3. So we'll say reshape and we'll say five rows and three columns. Now let's print this out. Okay, cool. So you can see we have a series of numbers from zero to 14 and they are in five rows and three columns, perfect. Now let's practice concatenating. The concat method joins data from separate sources into one combined table. If you want to join objects based on their row index values, you just call pd.concat method on the objects you want joined and then pass in the axis equals one argument. The axis equals one argument, tells Python to concatenate the data frames by adding columns. In other words, joining on the row index values. So, let's try that out. We'll say pd.concat and then we're going to say we want to concatenate DF_obj with DF_obj two. We're just concatenating our two data frames that we just created and then we're going to say that we want axis equal to one. This is going to tell Python to concatenate by adding columns. So basically it's doing concatenation along the row index values. So, we will run this. As you can see, now we have our original data frame with six columns and then we added another three columns. The concatenation was done based on the row index values. Now if you just wanted to concatenate based on the column index values, Python will add the data as rows instead of columns. To do that, you would just leave out the axis equal to one. So let's just copy and paste this command here and then I'll close the brackets and run it. Now you can see that these two data frames have been concatenated based on the column indexes. Now let's look at how to transform data. The first thing I want to do I is to how you how to drop data. You can easily drop rows from a data frame by calling the drop method and then passing in the index value for the rows you want dropped. So, let's just say DF_obj and then we're going to say .drop and we will say that we want to drop zero and two. So what this is going to do is this is going to drop the rows index that position zero and two from our data frame. When we print this out, you can see zero and two are now gone. Okay, now if you wanted to drop columns instead of dropping rows, you would then just call the same function, but you would pass in axis equal to one. So, let's try that here. I'm just going to say axis equal to one as a parameter for the drop function and run that. You see, now that we have dropped columns at position zero and at position two. So that is how that works. Now let's look at adding data. So for this exercise, I just want to create a series object that we can use to add as a variable to our data frames. To do that, we're going to say series.obj and then we're going to call this series constructor. We are going to say that we want series of values from zero to five. So that's np.arrange and then we want our six values here. Then, let's name this whole thing added variable. So, we can just say series_obj and then name and say it's equal to added_variable. Just creating a variable here and we'll print it out just to see what that looks like. Okay, cool. So we have a new variable. Now I want to show you how to join this to our existing data sources. We'll do that by using the join method. So we'll call this output variable added, 'cause the variable is going to be added to our data frame. So we're going to say data frame.join. Basically what we want to do is we want to join our data frame object with the series object we just created. So, we'll name both of these objects in our function and then print it out. Alright cool. So as you can see, the variable has now been added to our original data frame. Another thing I want to show you how to do. Instead of using the join method, you could also use the append method. So I'm going to show you two ways to use the append method. One is to leave the original index values in place and then the other is to basically recreate the index. So let's just create a new data table. We'll call it added_datatable. Then we are going to set that equal to variable_added. Then we'll call the append method off of this. So what I'm going to here is I'm just going to append the variable added data frame to itself. So what we're going to do is we're going to say variable added here to just basically add this data frame we created to itself. Since we want to leave the index values the same as they were, we don't want to re-index the output, we're going to say ignore_index equal to false. So, I'll just type that in and then let's print this out. So, do that, added_datatable and then run it. Okay, cool. So now we see that we have our variable added data frame and it's been added to itself, but then the index value, instead of starting at zero and continuing on to past ten, it has basically retained the index values from the original variable added data frames. So, you probably will want to actually regenerate index. That's accurate for the appended table. So let me show you how to do that. You can basically just take the same code here and you would leave out the ignore. You would just say ignore_index equal to true and run it. Now you can see that the output table has been appended, but also the index has been reset so it's increasing sequentially from zero to 11 which is desirable. The last thing I want to show you how to do in this section is just how to sort data. So in order to do that, we're going to use the sort values method and we're just going to call it off of our original data frame object. So, let's call this output at DF_sorted dataframe and then we're going to say equal to DF_obj and then we're going to call the sort values method. Say we want to sort the data frame by the values in column five. So what we would do is we would just say by equal to five, along index five and then we want to say ascending equal to false. So basically we want this returned and decreasing order. So we'll print this out and see what it looks like. As you can see, we have our column at index position five and all of the values in that column are in decreasing order. So, we have successfully sorted our data frame object by the values in column five.

In [0]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

In [0]:
DF_obj = pd.DataFrame(np.arange(36).reshape(6,6))
DF_obj

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3,4,5
1,6,7,8,9,10,11
2,12,13,14,15,16,17
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35


In [0]:
DF_obj_2 = pd.DataFrame(np.arange(15).reshape(5,3))
DF_obj_2

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


### Concatenating data

In [0]:
pd.concat([DF_obj, DF_obj_2], axis=1)

Unnamed: 0,0,1,2,3,4,5,0.1,1.1,2.1
0,0,1,2,3,4,5,0.0,1.0,2.0
1,6,7,8,9,10,11,3.0,4.0,5.0
2,12,13,14,15,16,17,6.0,7.0,8.0
3,18,19,20,21,22,23,9.0,10.0,11.0
4,24,25,26,27,28,29,12.0,13.0,14.0
5,30,31,32,33,34,35,,,


In [0]:
pd.concat([DF_obj, DF_obj_2])

Unnamed: 0,0,1,2,3,4,5
0,0,1,2,3.0,4.0,5.0
1,6,7,8,9.0,10.0,11.0
2,12,13,14,15.0,16.0,17.0
3,18,19,20,21.0,22.0,23.0
4,24,25,26,27.0,28.0,29.0
5,30,31,32,33.0,34.0,35.0
0,0,1,2,,,
1,3,4,5,,,
2,6,7,8,,,
3,9,10,11,,,


### Transforming data
#### Dropping data

In [0]:
DF_obj.drop([0, 2])

Unnamed: 0,0,1,2,3,4,5
1,6,7,8,9,10,11
3,18,19,20,21,22,23
4,24,25,26,27,28,29
5,30,31,32,33,34,35


In [0]:
DF_obj.drop([0, 2], axis=1)

Unnamed: 0,1,3,4,5
0,1,3,4,5
1,7,9,10,11
2,13,15,16,17
3,19,21,22,23
4,25,27,28,29
5,31,33,34,35


### Adding data

In [0]:
series_obj = Series(np.arange(6))
series_obj.name = "added_variable"
series_obj

0    0
1    1
2    2
3    3
4    4
5    5
Name: added_variable, dtype: int32

In [0]:
variable_added = DataFrame.join(DF_obj, series_obj)
variable_added

Unnamed: 0,0,1,2,3,4,5,added_variable
0,0,1,2,3,4,5,0
1,6,7,8,9,10,11,1
2,12,13,14,15,16,17,2
3,18,19,20,21,22,23,3
4,24,25,26,27,28,29,4
5,30,31,32,33,34,35,5


In [0]:
added_datatable = variable_added.append(variable_added, ignore_index=False)
added_datatable

Unnamed: 0,0,1,2,3,4,5,added_variable
0,0,1,2,3,4,5,0
1,6,7,8,9,10,11,1
2,12,13,14,15,16,17,2
3,18,19,20,21,22,23,3
4,24,25,26,27,28,29,4
5,30,31,32,33,34,35,5
0,0,1,2,3,4,5,0
1,6,7,8,9,10,11,1
2,12,13,14,15,16,17,2
3,18,19,20,21,22,23,3


In [0]:
added_datatable = variable_added.append(variable_added, ignore_index=True)
added_datatable

Unnamed: 0,0,1,2,3,4,5,added_variable
0,0,1,2,3,4,5,0
1,6,7,8,9,10,11,1
2,12,13,14,15,16,17,2
3,18,19,20,21,22,23,3
4,24,25,26,27,28,29,4
5,30,31,32,33,34,35,5
6,0,1,2,3,4,5,0
7,6,7,8,9,10,11,1
8,12,13,14,15,16,17,2
9,18,19,20,21,22,23,3


### Sorting data

In [0]:
DF_sorted = DF_obj.sort_values(by=(5), ascending=[False])
DF_sorted

Unnamed: 0,0,1,2,3,4,5
5,30,31,32,33,34,35
4,24,25,26,27,28,29
3,18,19,20,21,22,23
2,12,13,14,15,16,17
1,6,7,8,9,10,11
0,0,1,2,3,4,5
