# {index}`Basics with Pandas <single: Basics with Pandas>`
---

## What is a dataframe?

A data frame is a 2-D array of entries along with their corresponding labels. This can be visualized as something similar to an Excel sheet. For our usage, the rows will usually represent unique entries while the columns will represent individual attributes.

## How to make your own data frame

[Pandas](https://www.w3schools.com/python/pandas/pandas_getting_started.asp) is the library that makes it the easiest to develop your own data frames. You will generally be dealing with CSV files for data analysis, although you can also create your own if your data is formatted as a dictionary or an array.

- CSV to data frame

In [1]:
import pandas as pd  
#You will wanna first uplaod your csv to your project folder withing your IDE
# You can then convert it into a data frame using the read_csv() function
df = pd.read_csv("name.csv")  
   
# output the dataframe 
print(df)

FileNotFoundError: [Errno 2] No such file or directory: 'name.csv'

- Dictionary to data frame

In [8]:
import pandas as pd

#Format your data as a dictionary
data = {'col1': [1,2,3],
        'col2': [4,5,6]
}
#create data frame
df = pd.DataFrame(data = data)
print(df)

   col1  col2
0     1     4
1     2     5
2     3     6


- List to data frame

In [9]:
import pandas as pd
# Create your list
lst = ['fav', 'tutor', 'coding', 'skills']
#create data frame
df = pd.DataFrame(lst)
print(df)


        0
0     fav
1   tutor
2  coding
3  skills


## Data frame manipulation 
---

### Removing unncssacsry Colounms
 
It is necessary to delete entries and columns to make your data frame neater and more concise. Try to remove all the data that is not useful to you from your data set before you continue with any sort of analysis.


In [56]:
import pandas as pd

data = {'entry1': [1,2,3],
        'entry2': [4,5,6],
        'entry3': [7,8,9]
       }    
df = pd.DataFrame(data)


#This will permanantly drop entry3

df = df.drop( columns=['entry2'])
print(df)


   entry1  entry3
0       1       0
1       2       0
2       3       0


If you want to remove a row, address the "index" in place of "columns"

You Can check if entry3 has been removed correctly uisng:

In [53]:
'entry2' in df.columns

False

If removed it should read false as shown. You can also view all your axes to check what has been correctly altered as such:

In [54]:
df.describe()

Unnamed: 0,entry1,entry3
count,3.0,3.0
mean,2.0,8.0
std,1.0,1.0
min,1.0,7.0
25%,1.5,7.5
50%,2.0,8.0
75%,2.5,8.5
max,3.0,9.0


### Removing incomplete data

Sometimes your data may have rows with missing or unknown attributes, such rows can cause a problem while you are trying to anilize certain attributes. This can be prevented by removing all rows that have missing data implied with the usage of NumPy's NaN or NaT using the dropna() function.

In [7]:
import pandas as pd
import numpy as np

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT]})
print("This is the original data frame")
print(df)

print("\nThis is the reduced data frame")
df = df.dropna()
print(df)



This is the original data frame
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

This is the reduced data frame
     name        toy       born
1  Batman  Batmobile 1940-04-25


### Identifying data types

It is good practice to check the data types in your data set before beginning analysis to ensure you don't run into invalid data type related issues going onwards. The best and most ideal way to do this is using the dtypes() function

In [117]:
# Example data set
dict = {'float': 7.869,
        'int': 10,
        'datetime': pd.Timestamp('20180310'),
        'string': "Hello"}
df = pd.DataFrame(dict, index = [1,2,3,4])
# providing in index is sometimes neccsary when conerting a dictionary to a data frame
print(df.dtypes)


float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object


### Converting data types

In [98]:
import pandas as pd

# Create and print DataFrame
df = pd.DataFrame({
   'A': ['1', '2', '3'],
   'B': ['4', '5', '6'],
   'C': ['7', '8', '9']
})
print(df)

# Print data types of each column in DataFrame
print("\n")
print("Original Data Types")

print(df.dtypes)

# Change column A's values to floats
df['A'] = df['A'].astype(float)

# Change column B and C's values to integers
df = df.astype({'B': int, 'C': int})


# Print altered DataFrame
print("\nConverted Data Types")
print(df)
# Print data types of each column in DataFrame
print("\n")
print(df.dtypes)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


Original Data Types
A    object
B    object
C    object
dtype: object

Converted Data Types
     A  B  C
0  1.0  4  7
1  2.0  5  8
2  3.0  6  9


A    float64
B      int64
C      int64
dtype: object


### Renaming Colounms

Renaming columns and rows can easily be done using the rename() function.

In [41]:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.rename(columns={"A": "a", "B": "c"})

df.describe()


Unnamed: 0,a,c
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


### Combining Data Frames
If you want to combine diffrent data frames the usage of the concat() function is ideal.

In [45]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)


df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)


frames = [df1, df2, df3]

result = pd.concat(frames)

print(result)

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


## Other Usefull Functions

### Viewing data

|Abbrevation|colour|
|:----:|:--------:|
|df.head | df.tail() will provide you with the first 5 rows by defualt unless a number is inserted in the paramter |
|df.tail | df.tail() will provide you with the last 5 rows by defualt unless a number is inserted in the paramter |
|display(df) | With the use of Ipython.display's function, display(), you can view your data frame in tabular form |


In [24]:
from IPython.display import display
df = pd.DataFrame({
   'A': ['1', '2', '3',"4"],
   'B': ['5', '6', '7',"8"],
   'C': ['7', '8', '9',"10"],

})
print("df.head():")
head = df.head(2) #df. head() will provide you with the first 5 rows by defualt unless a number is inserted in the paramter
print(head)
print("\ndf.tail():")
tail = df.tail(3) #df.tail() works in the same way although with the last 5 rows
print(tail)
print("\ndisplay(df):")
display(df)


df.head():
   A  B  C
0  1  5  7
1  2  6  8

df.tail():
   A  B   C
1  2  6   8
2  3  7   9
3  4  8  10

display(df):


Unnamed: 0,A,B,C
0,1,5,7
1,2,6,8
2,3,7,9
3,4,8,10


### Removing duplicates


In [131]:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 6, 7, 7],
                   'B': ['apple', 'banana', 'banana', 'orange', 'apple', 'banana', 'orange', 'apple']
                  })
df.drop_duplicates(subset = "A") 
# The chosen subset is the attribute that will be traced for any repetion

Unnamed: 0,A,B
0,1,apple
1,2,banana
3,3,orange
4,4,apple
5,6,banana
6,7,orange


### Viewing dateframe size

In [139]:
df = pd.DataFrame({
   'A': ['1', '2', '3',"4"],
   'B': ['5', '6', '7',"8"],
   'C': ['7', '8', '9',"10"],
})
df.shape

(4, 3)