## Loading Data

In this notebook, we will explain how to load data from CSV and Excel files in pandas, along with examples. Let's start with loading data from a CSV file:

### Loading data from a CSV file:

To load data from a CSV file, you can use the `read_csv()` function in pandas. This function reads the contents of a CSV file and returns a DataFrame object, which is a two-dimensional tabular data structure in pandas.

Here's the syntax of the `read_csv()` function:

```python
pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, ...)
```

Now, let's go through the important parameters of the `read_csv()` function:

- `filepath_or_buffer`: This parameter specifies the path or URL of the CSV file to be loaded. It can be a string representing the file path or a URL.
- `sep`: This parameter specifies the delimiter character that separates the values in the CSV file. By default, it is set to ',' (comma).
- `delimiter`: This parameter is an alias for `sep`. Both can be used interchangeably.
- `header`: This parameter specifies which row in the CSV file should be considered as the header row. By default, it is set to 'infer', which means the function will try to determine the header row automatically. If you want to specify a different row as the header, you can pass an integer value representing the row number. For example, `header=0` specifies the first row as the header.
- `names`: This parameter allows you to specify custom column names for the DataFrame. You can pass a list of strings representing the column names.
- `index_col`: This parameter specifies which column should be used as the index of the DataFrame. You can pass the column name or column index as the value.

Now, let's see an example of loading data from a CSV file:

```python
import pandas as pd

# Load data from a CSV file
data = pd.read_csv('path/to/data.csv', sep=';', header=0, index_col='ID')

# Display the loaded data
print(data.head())
```

In this example, we load the data from the 'data.csv' file. We specify the delimiter as ';' using the `sep` parameter. We set `header=0` to consider the first row as the header. We also specify that the 'ID' column should be used as the index of the DataFrame using the `index_col` parameter.

In [1]:
import pandas as pd

titanic = pd.read_csv('./data/titanic.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [2]:
titanic[['Name', 'Survived']]

Unnamed: 0,Name,Survived
0,"Braund, Mr. Owen Harris",0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",0
...,...,...
886,"Montvila, Rev. Juozas",0
887,"Graham, Miss. Margaret Edith",1
888,"Johnston, Miss. Catherine Helen ""Carrie""",0
889,"Behr, Mr. Karl Howell",1


In [5]:
survived = titanic[titanic['Survived'] == 1]
survived

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [7]:
survived[survived['Sex'] == 'female']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S


In [15]:
titanic['is_junior'] = titanic['Age'] < 15
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_junior
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,False
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,False
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,False
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,False


### Loading data from an Excel file:

first note that you should install `xlrd` using `pip`:

```
pip install xlrd
```

To load data from an Excel file, you can use the `read_excel` function in pandas. This function reads the contents of an Excel file and returns a DataFrame object.

Here's the syntax of the `read_excel` function:

```python
pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, ...)
```

Let's go through the important parameters of the `read_excel()` function:

- `io`: This parameter specifies the path, URL, or ExcelFile object of the Excel file to be loaded.
- `sheet_name`: This parameter specifies the sheet from which the data should be loaded. By default, it is set to 0, which loads the first sheet. You can pass the sheet name (as a string) or index (as an integer).
- `header`: This parameter specifies which row in the Excel sheet should be considered as the header row. By default, it is set to 0, which means the first row is considered as the header. If you want to specify a different row as the header, you can pass an integer value representing the row number.
- `names`: This parameter allows you to specify custom column names for the DataFrame. You can pass a list of strings representing the column names.
- `index_col`: This parameter specifies which column should be used as the index of the DataFrame. You can pass the column name or column index as the value.

Now, let's see an example of loading data from an Excel file:

```python
import pandas as pd

# Load data from an Excel file
data = pd.read_excel('path/to/data.xlsx', sheet_name='Sheet1', header=0, index_col='ID')

# Display the loaded data
print(data.head())
```

In this example, we load the data from the 'data.xlsx' file. We specify the sheet name as 'Sheet1' using the `sheet_name` parameter. We set `header=0` to consider the first row as the header. We also specify that the 'ID' column should be used as the index of the DataFrame using the `index_col` parameter.

In [28]:
movies = pd.read_excel('./data/movies.xls')
movies

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
1,Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,...,2,2,0.0,4,0,1,5,1,1.0,4.8
2,The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
3,Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
4,Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,...,426,20,3.0,455,926,1,7431,84,71.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,Twin Falls Idaho,1999,Drama,English,USA,R,111,1.85,500000.0,985341.0,...,980,505,482.0,3166,180,0,3479,87,54.0,7.3
1334,Universal Soldier: The Return,1999,Action|Sci-Fi,English,USA,R,83,1.85,24000000.0,10431220.0,...,2000,577,485.0,4024,401,0,24216,162,75.0,4.1
1335,Varsity Blues,1999,Comedy|Drama|Romance|Sport,English,USA,R,106,1.85,16000000.0,52885587.0,...,23000,255,35.0,23369,0,0,35312,267,67.0,6.4
1336,Wild Wild West,1999,Action|Comedy|Sci-Fi|Western,English,USA,PG-13,106,1.85,170000000.0,113745408.0,...,10000,4000,582.0,15870,0,2,129601,648,85.0,4.8


In [30]:
movies[['Title', 'Genres']]

Unnamed: 0,Title,Genres
0,Intolerance: Love's Struggle Throughout the Ages,Drama|History|War
1,Over the Hill to the Poorhouse,Crime|Drama
2,The Big Parade,Drama|Romance|War
3,Metropolis,Drama|Sci-Fi
4,Pandora's Box,Crime|Drama|Romance
...,...,...
1333,Twin Falls Idaho,Drama
1334,Universal Soldier: The Return,Action|Sci-Fi
1335,Varsity Blues,Comedy|Drama|Romance|Sport
1336,Wild Wild West,Action|Comedy|Sci-Fi|Western
