### Dataframe from CSV

In the field of Data Science, **[CSV](https://en.wikipedia.org/wiki/Comma-separated_values)** files are used to store large datasets. To efficiently analyze such datasets, we need to convert them into pandas DataFrame.

To create a DataFrame from CSV, we use the **`read_csv('file_name')`** function that takes the file name as input and returns DataFrame as output.

**Example 1:**

Let’s see how to read the **[stockprice_data.csv](https://github.com/milaan9/10_Python_Pandas_Module/blob/main/stockprice_data.csv)** file into the DataFrame and then convert it into Pandas Series

<div>
<img src="img/csvfile1.png" width="300" />
</div>

In [18]:
import pandas as pd
data = pd.read_csv("stockprice_data.csv")
data

Unnamed: 0,Date,Closing price,Return
0,1/1/2020,100,0.01
1,2/1/2020,120,0.2
2,3/1/2020,130,0.083333
3,4/1/2020,98,-0.246154
4,5/1/2020,50,-0.489796
5,6/1/2020,102,1.04
6,7/1/2020,104,0.019608
7,8/1/2020,150,0.442308
8,9/1/2020,160,0.066667
9,10/1/2020,109,-0.31875


In [9]:
type(data)

pandas.core.frame.DataFrame

In [10]:
data1 = data.iloc[:,2]     # Convert Pandas Data Frame in to Pandas Series
data1

0     0.010000
1     0.200000
2     0.083333
3    -0.246154
4    -0.489796
5     1.040000
6     0.019608
7     0.442308
8     0.066667
9    -0.318750
10   -0.128440
Name: Return, dtype: float64

In [19]:
type(data1)

pandas.core.series.Series

**Example 2:**

Let’s see how to read the **[automobile_data.csv](https://github.com/milaan9/10_Python_Pandas_Module/blob/main/automobile_data.csv)** file into the DataFrame.

<div>
<img src="img/csvfile.png" width="500"/>
</div>

In [20]:
cars = pd.read_csv("automobile_data.csv")  # just give the name of file only if the file is in the same folder.
print(cars)

    index      company   body-style  wheel-base  length engine-type  \
0       0  alfa-romero  convertible        88.6   168.8        dohc   
1       1  alfa-romero  convertible        88.6   168.8        dohc   
..    ...          ...          ...         ...     ...         ...   
59     87        volvo        sedan       104.3   188.8         ohc   
60     88        volvo        wagon       104.3   188.8         ohc   

   num-of-cylinders  horsepower  average-mileage    price  
0              four         111               21  13495.0  
1              four         111               21  16500.0  
..              ...         ...              ...      ...  
59             four         114               23  12940.0  
60             four         114               23  13415.0  

[61 rows x 10 columns]


## DataFrame Options

When DataFrame is vast, and we can not display the whole data while printing. In that case, we need to change how DataFrame gets display on the console using the **`print()`** function. For that, pandas have provided many options and functions to customize the presentation of the DataFrame.

### To customize the display of DataFrame while printing

When we display the DataFrame using **`print()`** function by default, it displays 10 rows (top 5 and bottom 5). Sometimes we may need to show more or lesser rows than the default view of the DataFrame.

We can change the setting by using **`pd.options`** or **`pd.set_option()`** functions. Both can be used interchangeably.

The below example will show a maximum of 20 and a minimum of 5 rows while printing DataFrame.

In [22]:
import pandas as pd

# Setting maximum rows to be shown
pd.options.display.max_rows = 20

# Setting minimum rows to be shown
pd.set_option("display.min_rows", 10)

# Print DataFrame
print(cars)

    index      company   body-style  wheel-base  length engine-type  \
0       0  alfa-romero  convertible        88.6   168.8        dohc   
1       1  alfa-romero  convertible        88.6   168.8        dohc   
2       2  alfa-romero    hatchback        94.5   171.2        ohcv   
3       3         audi        sedan        99.8   176.6         ohc   
4       4         audi        sedan        99.4   176.6         ohc   
..    ...          ...          ...         ...     ...         ...   
56     81   volkswagen        sedan        97.3   171.7         ohc   
57     82   volkswagen        sedan        97.3   171.7         ohc   
58     86   volkswagen        sedan        97.3   171.7         ohc   
59     87        volvo        sedan       104.3   188.8         ohc   
60     88        volvo        wagon       104.3   188.8         ohc   

   num-of-cylinders  horsepower  average-mileage    price  
0              four         111               21  13495.0  
1              four        

## DataFrame metadata

Sometimes we need to get metadata of the DataFrame and not the content inside it. Such metadata information is useful to understand the DataFrame as it gives more details about the DataFrame that we need to process.

In this section, we cover the functions which provide such information of the DataFrame.

Let’s take an example of student DataFrame which contains **'Name'**, **'Age'** and **'Marks'** of students as shown below:

```python
    Name  Age  Marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54
```

### Metadata info of DataFrame

**`DataFrame.info()`** is a function of DataFrame that gives metadata of DataFrame. Which includes,

* Number of rows and its range of index
* Total number of columns
* List of columns
* Count of the total number of non-null values in the column
* Data type of column
* Count of columns in each data type
* Memory usage by the DataFrame

In [23]:
# Example: In the below example, we got metadata information of student DataFrame.

# get dataframe info
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      int64  
 2   Marks   3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 204.0+ bytes


### Get the statistics of DataFrame

**`DataFrame.describe()`** is a function that gives mathematical statistics of the data in DataFrame. But, it applies to the columns that contain numeric values.

In our example of student DataFrame, it gives descriptive statistics of **'Age'** and **'Marks'** columns only, that includes:

1. **count**: Total number of non-null values in the column
2. **mean**: an average of numbers
3. **std**: a standard deviation value
4. **min**: minimum value
5. **25%**: 25th percentile
6. **50%**: 50th percentile
7. **75%**: 75th percentile
8. **max**: maximum value

>**Note:** Output of **`DataFrame.describe()`** function varies depending on the input DataFrame.

In [24]:
# Example

# get dataframe description
student_df.describe()

Unnamed: 0,Age,Marks
count,3.0,3.0
mean,20.0,84.813333
std,1.0,6.874484
min,19.0,77.8
25%,19.5,81.45
50%,20.0,85.1
75%,20.5,88.32
max,21.0,91.54


## DataFrame Attributes

DataFrame has provided many built-in attributes. Attributes do not modify the underlying data, unlike functions, but it is used to get more details about the DataFrame.

Following are majorly used attributes of the DataFrame:

| Attribute | Description |
|:---- |:---- |
| **`DataFrame.index`**   | **It gives the Range of the row index** |
| **`DataFrame.columns`** | **It gives a list of column labels** |
| **`DataFrame.dtypes`**  | **It gives column names and their data type** |
| **`DataFrame.values`**  | **It gives all the rows in DataFrame** |
| **`DataFrame.empty`**   | **It is used to check if the DataFrame is empty** |
| **`DataFrame.size`**    | **It gives a total number of values in DataFrame** |
| **`DataFrame.shape`**   | **It a number of rows and columns in DataFrame** |

In [25]:
# Example:

import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Age': [20, 21, 19], 'Marks': [85.10, 77.80, 91.54]}

student_df = pd.DataFrame(student_dict)

print("DataFrame : ", student_df)

print("DataFrame Index : ", student_df.index)
print("DataFrame Columns : ", student_df.columns)

print("DataFrame Column types : ", student_df.dtypes)

print("DataFrame is empty? : ", student_df.empty)

print("DataFrame Shape : ", student_df.shape)
print("DataFrame Size : ", student_df.size)

print("DataFrame Values : ", student_df.values)

DataFrame :      Name  Age  Marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54
DataFrame Index :  RangeIndex(start=0, stop=3, step=1)
DataFrame Columns :  Index(['Name', 'Age', 'Marks'], dtype='object')
DataFrame Column types :  Name      object
Age        int64
Marks    float64
dtype: object
DataFrame is empty? :  False
DataFrame Shape :  (3, 3)
DataFrame Size :  9
DataFrame Values :  [['Joe' 20 85.1]
 ['Nat' 21 77.8]
 ['Harry' 19 91.54]]


## DataFrame selection

While dealing with the vast data in DataFrame, a data analyst always needs to select a particular row or column for the analysis. In such cases, functions that can choose a set of rows or columns like top rows, bottom rows, or data within an index range play a significant role.

Following are the functions that help in selecting the subset of the DataFrame:

| Attribute | Description |
|:---- |:---- |
| **`DataFrame.head(n)`**  | **It is used to select top ‘n’ rows in DataFrame.** |
| **`DataFrame.tail(n)`**  | **It is used to select bottom ‘n’ rows in DataFrame.** |
| **`DataFrame.at`**       | **It is used to get and set the particular value of DataFrame using row and column labels.** |
| **`DataFrame.iat`**      | **It is used to get and set the particular value of DataFrame using row and column index positions.** |
| **`DataFrame.get(key)`** | **It is used to get the value of a key in DataFrame where Key is the column name.** |
| **`DataFrame.loc()`**    | **It is used to select a group of data based on the row and column labels. It is used for slicing and filtering of the DataFrame.** |
| **`DataFrame.iloc()`**   | **It is used to select a group of data based on the row and column index position. Use it for slicing and filtering the DataFrame.** |

In [17]:
# Example

import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Age': [20, 21, 19], 'Marks': [85.10, 77.80, 91.54]}

student_df = pd.DataFrame(student_dict)

# display dataframe
print("DataFrame : \n", student_df)

# select top 2 rows
print("\nstudent_df.head(2) : \n")
print(student_df.head(2))

# select bottom 2 rows
print("\nstudent_df.tail(2) : \n")
print(student_df.tail(2))

# select value at row index 0 and column 'Name'
print("\nstudent_df.at[0, 'Name'] : \n")
print(student_df.at[0, 'Name'])

# select value at first row and first column
print("\nstudent_df.iat[0, 0] : \n")
print(student_df.iat[0, 0])


# select values of 'Name' column
print("\nstudent_df.get('Name') : \n")
print(student_df.get('Name'))

# select values from row index 0 to 2 and 'Name' column
print("\nstudent_df.loc[0:2, ['Name']] : \n")
print(student_df.loc[0:2, ['Name']])

# select values from row index 0 to 2(exclusive) and column position 0 to 2(exclusive)
print("\nstudent_df.iloc[0:2, 0:2] : \n")
print(student_df.iloc[0:2, 0:2])

DataFrame : 
     Name  Age  Marks
0    Joe   20  85.10
1    Nat   21  77.80
2  Harry   19  91.54

student_df.head(2) : 

  Name  Age  Marks
0  Joe   20   85.1
1  Nat   21   77.8

student_df.tail(2) : 

    Name  Age  Marks
1    Nat   21  77.80
2  Harry   19  91.54

student_df.at[0, 'Name'] : 

Joe

student_df.iat[0, 0] : 

Joe

student_df.get('Name') : 

0      Joe
1      Nat
2    Harry
Name: Name, dtype: object

student_df.loc[0:2, ['Name']] : 

    Name
0    Joe
1    Nat
2  Harry

student_df.iloc[0:2, 0:2] : 

  Name  Age
0  Joe   20
1  Nat   21
