<a href="https://colab.research.google.com/github/keskinus/Data-Analysis-/blob/main/DataFrame_Data_Structure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

# DataFrame

DataFrame is a representation of data along 2 dimensions. You can think of it as a two-dimensional series, or as a concatenation of multiple series objects that share the same index.





## Creating a dataframe using a dictionary with a list of items.

In [None]:
# There are many ways to construct a Data Frame, 
# though one of the most common is from a dict of equal-length lists or Numpy arrays.

data = {'students': ['Alice','Bob','Charlie','Dave','Eva', 'Frank'],
      'subjects': ['Bio','Physics','Math','Arts','Chemistry', 'Economics'],
      'score1': [55, 40, 63, 90, 45, 45]}

df = pd.DataFrame(data)
df



Unnamed: 0,students,subjects,score1
0,Alice,Bio,55
1,Bob,Physics,40
2,Charlie,Math,63
3,Dave,Arts,90
4,Eva,Chemistry,45
5,Frank,Economics,45


The DataFrame will have its index assigned automatically as with Series.

The columns of a dataframe can be accessed using the `.columns` attribute. An individual column can be accessed using the name of the column.

Notice that the column name is an attribute (the parenthesis of a function is not used, we also do not need the indexing brackets).

Every column of data is a series object.


In [None]:
df.subjects

0          Bio
1      Physics
2         Math
3         Arts
4    Chemistry
5    Economics
Name: subjects, dtype: object

In [None]:
df.columns

Index(['students', 'subjects', 'score1'], dtype='object')

In [None]:
df.index

RangeIndex(start=0, stop=6, step=1)

Notice that rows are an index. `columns` are an index too!

**Exercise 1**

Create an employee table with 5 employees, create columns for names, department, and salary.  Give them employee id starting from 1000. Make the employee id as the index for the table.

In [None]:
data1={'employee Id':[1000,1001,1002,1003,1004],
       'employee': ['Alice','Bob','Charlie','Dave','Eva'],
      'department': ['Marketing','IT','Finance','Accounting','Sale'],
      'salary': [24000, 40000, 6000, 4800, 90000]}
df = pd.DataFrame(data1)
df


Unnamed: 0,employee Id,employee,department,salary
0,1000,Alice,Marketing,24000
1,1001,Bob,IT,40000
2,1002,Charlie,Finance,6000
3,1003,Dave,Accounting,4800
4,1004,Eva,Sale,90000


In [None]:
data1={'employee': ['Alice','Bob','Charlie','Dave','Eva'],
      'department': ['Marketing','IT','Finance','Accounting','Sale'],
      'salary': [24000, 40000, 6000, 4800, 90000]}
employee_id:[1000,1001,1002,1003,1004]
df_ex_one = pd.DataFrame(data1, index=employee_id)
df_ex_one.index.name = 'employee_id'
df_ex_one

Unnamed: 0_level_0,employee,department,salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,Alice,Marketing,24000
1001,Bob,IT,40000
1002,Charlie,Finance,6000
1003,Dave,Accounting,4800
1004,Eva,Sale,90000


In [None]:
exercise_one = { 'Name': ['Owen', 'Priya', 'Rajasekar', 'James', 'Nicole'], \
                'Department': ['Legal', 'Operations', 'Accounts Receivable', 'Legal', 'Design'], \
                'Salary': [125000, 75000, 60000, 115000, 95000]}
employee_id = [1000, 1001, 1002, 1003, 1004]
df_ex_one = pd.DataFrame(exercise_one, index=employee_id)
df_ex_one.index.name = 'employee_id'
df_ex_one

Unnamed: 0_level_0,Name,Department,Salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,Owen,Legal,125000
1001,Priya,Operations,75000
1002,Rajasekar,Accounts Receivable,60000
1003,James,Legal,115000
1004,Nicole,Design,95000


## Operations on series
Every column of data is a `Series` object.  All the columns share the common index.

In [None]:
df.score1

0    55
1    40
2    63
3    90
4    45
5    45
Name: score1, dtype: int64

In [None]:
type(df.score1)

pandas.core.series.Series

In [None]:
type(df.students)

pandas.core.series.Series

In [None]:
df.index

RangeIndex(start=0, stop=6, step=1)

Notice that a DataFrame is a series of Series objects.  Every Series is in turn made up of numpy arrays.

So, everything we learnt in numpy and Series can be applied in the context of dataframes too.



In [None]:
df.score1

0    55
1    40
2    63
3    90
4    45
5    45
Name: score1, dtype: int64

In [None]:
type(df.score1)

pandas.core.series.Series

In [None]:
type(df.score1.values)

numpy.ndarray

In [None]:
type(df.students.values)

numpy.ndarray

In [None]:
df.students

0      Alice
1        Bob
2    Charlie
3       Dave
4        Eva
5      Frank
Name: students, dtype: object

### Numpy broadcasting



In [None]:
df.score1 += 5
df

Unnamed: 0,students,subjects,score1
0,Alice,Bio,60
1,Bob,Physics,45
2,Charlie,Math,68
3,Dave,Arts,95
4,Eva,Chemistry,50
5,Frank,Economics,50


Notice the `+=`.  Just a plus will give us a copy of the Series and not mutate the dataframe.

In [None]:
df.score1 + 5

0     65
1     50
2     73
3    100
4     55
5     55
Name: score1, dtype: int64

In [None]:
df

Unnamed: 0,students,subjects,score1
0,Alice,Bio,60
1,Bob,Physics,45
2,Charlie,Math,68
3,Dave,Arts,95
4,Eva,Chemistry,50
5,Frank,Economics,50


### Extracting items using `iloc` and `loc`

We can extract one row using `iloc` and `loc` just like in Series.

In [None]:
df.iloc[0]

students    Alice
subjects      Bio
score1         60
Name: 0, dtype: object

This looks like a Series object!

In [None]:
type(df.iloc[0])

pandas.core.series.Series

In [None]:
df.iloc[0].index

Index(['students', 'subjects', 'score1'], dtype='object')

Let us take this idea further!  Can we concatenate the rows as a list of Series?

## Dataframe as a list of Series

The same dataframe seen in a different way.  In this case the index of each series object is the name of the column.

In [None]:
s0 = pd.Series({'students': 'Alice', 'Subjects' : 'Bio', 'score1' : 60})
s1 = pd.Series({'students': 'Bob', 'Subjects' : 'Physics', 'score1' : 45})
s2 = pd.Series({'students': 'Charlie', 'Subjects' : 'Math', 'score1' : 68})
s3 = pd.Series({'students': 'Dave', 'Subjects' : 'Arts', 'score1' : 95})
s4 = pd.Series({'students': 'Eva', 'Subjects' : 'Chemistry', 'score1' : 50})
s5 = pd.Series({'students': 'Frank', 'Subjects' : 'Economics', 'score1' : 50})

df2 = pd.DataFrame([s0, s1, s2, s3, s4, s5])
df2

Unnamed: 0,students,Subjects,score1
0,Alice,Bio,60
1,Bob,Physics,45
2,Charlie,Math,68
3,Dave,Arts,95
4,Eva,Chemistry,50
5,Frank,Economics,50


The distinction between a column and a row is really only a conceptual distinction. 
And you can think of the DataFrame itself as simply a two-axes labeled array.

So if we extract a row object, it is also a series.

In [None]:
type(df.iloc[0])

pandas.core.series.Series

The row has elements of different types.  So when we extract the rows the type of the elements is an `object`.

In [None]:
df.iloc[0].values

array(['Alice', 'Bio', 60], dtype=object)

## Some basic dataframe functions

In [None]:
 df.head() # The head method selects only the first five rows by default

Unnamed: 0,students,subjects,score1
0,Alice,Bio,60
1,Bob,Physics,45
2,Charlie,Math,68
3,Dave,Arts,95
4,Eva,Chemistry,50


In [None]:
df.tail()

Unnamed: 0,students,subjects,score1
1,Bob,Physics,45
2,Charlie,Math,68
3,Dave,Arts,95
4,Eva,Chemistry,50
5,Frank,Economics,50


## Indexing

Subsetting along one index works like in Series.

In [None]:
df[1:3]

Unnamed: 0,students,subjects,score1
1,Bob,Physics,45
2,Charlie,Math,68


In [None]:
type(df[1:3])

pandas.core.frame.DataFrame

Subsetting will give a dataframe.  So we can subset along the second dimension too.  This will give us a Series object, as expected.

In [None]:
df[1:3]['subjects']

1    Physics
2       Math
Name: subjects, dtype: object

In [None]:
type(df[1:3]['subjects'])

pandas.core.series.Series

As we noticed earlier, there is no preference for one or the other axis. Rows are an index, columns are also an index. So we can subset along the columns too.

In [None]:
df['subjects']

0          Bio
1      Physics
2         Math
3         Arts
4    Chemistry
5    Economics
Name: subjects, dtype: object

In [None]:
df[['subjects', 'students']]

Unnamed: 0,subjects,students
0,Bio,Alice
1,Physics,Bob
2,Math,Charlie
3,Arts,Dave
4,Chemistry,Eva
5,Economics,Frank


In [None]:
df[['subjects', 'students']][1:3]

Unnamed: 0,subjects,students
1,Physics,Bob
2,Math,Charlie
