<a href="https://colab.research.google.com/github/praneshdas/data-analysis/blob/main/Python6_pandas_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Pandas Library

``pandas`` is a library for data manipulation and analysis in Python programming language. In particular, it offers data structures and operations for manipulating numerical data.

## 2. Import library
Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:

In [1]:
import pandas as pd
pd.__version__

'1.5.3'

## 3. Major Data Structures

The most commonly used pandas data structures are:

<ul>
  <li>Series</li>
  <li>DataFrames</li>
</ul>

## 3.1 Pandas Series
Pandas ``Series`` is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` attributes.
The ``values`` are simply a familiar NumPy array:

In [None]:
print(data.values)

[0.25 0.5  0.75 1.  ]


Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
data[1]

0.5

In [None]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [None]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [None]:
data['b']

0.5

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=[2, 5, 3, 7])#We can even use non-sequential indices:
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [None]:
data[5]

0.5

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Florida': 9000,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida           9000
Illinois      12882135
dtype: int64

By default, a ``Series`` will be created where the index is drawn from the sorted keys.

In [None]:
population['California']

38332521

The ``Series`` also supports array-style operations such as slicing:

In [None]:
# population['California':'Illinois']

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

``data`` can be a scalar, which is repeated to fill the specified index:

In [None]:
print(pd.Series(5, index=[100, 200, 300]))

100    5
200    5
300    5
dtype: int64


``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [None]:
print(pd.Series({2:'a', 1:'b', 3:'c'}))

2    a
1    b
3    c
dtype: object


In each case, the index can be explicitly set if a different result is preferred:

In [None]:
print(pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]))

3    c
2    a
dtype: object


Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

### 3.2 The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
You can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects. Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [None]:
states = pd.DataFrame({'population': population,'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,9000,170312
Illinois,12882135,149995


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [None]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
states.columns

Index(['population', 'area'], dtype='object')

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

In [None]:
type(states['area'])

pandas.core.series.Series

In [None]:
print(states.iloc[4])

print(states.iloc[4]['area'])

population    12882135
area            149995
Name: Illinois, dtype: int64
149995




```
# This is formatted as code
```

Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
# pd.DataFrame([{'a': 1, 'b': 2, 'c': 4}, {'a':4,'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### 4.1 Mount Drive (Need Only for Colab, not required in case of Local)


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
states.to_csv('states.csv')

In [None]:
df = pd.read_csv('states.csv')
df

Unnamed: 0.1,Unnamed: 0,population,area
0,California,38332521,423967
1,Texas,26448193,695662
2,New York,19651127,141297
3,Florida,9000,170312
4,Illinois,12882135,149995


In [None]:
df_2 = pd.read_csv('states.csv',header=0,usecols=["population", "area"])
df_2

Unnamed: 0,population,area
0,38332521,423967
1,26448193,695662
2,19651127,141297
3,9000,170312
4,12882135,149995



`pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,  index_col=None, usecols=None, engine=None, skiprows=None, nrows=None) `

Parameters:  

*   `filepath_or_buffer:` It is the location of the file which is to be retrievedusing this function. It accepts any string path or URL of the file.
*   `sep`: It stands for separator, default is ‘, ‘ as in CSV(comma separated values).

*   `header`: It accepts int, a list of int, row numbers to use as the column names, and the start of the data. If no names are passed, i.e., header=None, then,  it will display the first column as 0, the second as 1, and so on.
*   `usecols`: It is used to retrieve only selected columns from the CSV file.
*   `nrows`: It means a number of rows to be displayed from the dataset.
*  ` index_col`: If None, there are no index numbers displayed along with records.  
*   `skiprows`: Skips passed rows in the new data frame.

## References
https://github.com/jakevdp/PythonDataScienceHandbook