## 298. Introduction
- NumPy is a great library to work with homogeneous numeric data, which uses integer based indexing, but it is not a great library to handle the Big Data today
- Big Data today needs a data structures that can be easily customized
- Big Data comes in mixed types and can have missing data to be handled
- Also, we need various functions, mathematical operations that need to be applied to Big Data, that is where `Pandas` come in
- The word `Panda` is derived from `Panel Data`
- Examples are stock prices, players' scores across matches, students' grades across exams, and so on
        NumPy                   Pandas
        Numeric                 Custom
        Integer Indexing        Mixed/Missing
                                Manipulation
- Pandas are classified into
    1. `Series`
        - to handle one-dimensional data
    2. `DataFrames`
        - to handle two-dimensional data
- Pandas use arrays behind the scenes, and they're very closely related to NumPy library
- Several NumPy library functions acts up `Series` and `DataFrames` as functional arguments, so that you can use Pandas with NumPy libaray as well
- Both `Series` and `DataFrames` will allow us to easily select and manipulate the data
- we can apply functions like `map` `reduce` right out of the box
- we can perform various mathematical operations on Big Data
- Also, we can visualize the data in differen formats
- All this is in-built into `Pandas`

## 299. Series
- A `Series` is an enhanced one-dimensional array
- while arrays use `zero-based indexing` which is numeric, `Series` support `custom indexing` like strings
- `Series` also handle missing data, as many functions in `NumPy` ignore the missing data
- we can create a `Series` using a `list`, `numpy.ndarray`, `map`, etc.
- The default index in `Series` is a numeric value which starts from zero, but we can customize it
- We're going to create several Series of your own and explore different functions on `Series` like `count()`, `mean()`, `min()`, `max()`, `std()`, `describe()`, and more

## 300. Create Project
- To install `Pandas` from the commandline, you've to execute
``` python
pip3 install pandas
```
- This will install `Pandas` for your Python environment

In [None]:
# pandas

## 301. Create and use Series
- We'll start exporing Pandas Series
- `pandas.Series(data=None, index=None, dtype='Dtype|None'=None, name=None, copy='bool|None'=None, fastpath='bool|lib:NoDefault'=<no_default>, )`
    - One-dimensional ndarray with axis-labels (including time series)
    - `data` : array-like, Iterable, dict, or scalar value
        - Contains data stored in Series. If data is a dict, argument order is maintained.
    - `index` : array-like or Index (1d)
        - Values must be hashable and have the same length as `data`.
        - Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided.
        - If data is dict-like and index is None, then the keys in the data are used as the index.
        - If the index is not None, the resulting Series is reindexed with the index values.
    - `dtype `: str, numpy.dtype, or ExtensionDtype, optional
        - Data type for the output Series. If not specified, this will be inferred from `data`.
    - `name` : Hashable, default None
        - The name to give to the Series.
    - `copy` : bool, default False
        - Copy input data. Only affects Series or 1d ndarray input
- `pandas.Series.count()`
    - return the number of Non-NA/null observations in the Series
- `pandas.Series.mean(axis:'Axis|None'=0, skipna:'bool'=True, numeric_only:'bool'=False, **kwargs, )`
    - return the mean of the values over the requested axis
- `pandas.Series.min(axis:'Axis|None'=0, skipna:'bool'=True, numeric_only:'bool'=False, **kwargs, )`
    - return the minimum of the values over the requested axis
- `pandas.Series.max(axis:'Axis|None'=0, skipna:'bool'=True, numeric_only:'bool'=False, **kwargs, )`
    - return the maximum of the values over the requested axis
- `pandas.Series.std(axis:'Axis|None'=0, skipna:'bool'=True, ddof:'int'=1,  numeric_only:'bool'=False, **kwargs, )`
    - return sample standard deviation over the requested axis
    - Normalized by N-1 by default. This can be changed using the ddof argument.


In [None]:
# pandas
# series_demo.py
import pandas as pd

reviews = pd.Series([4.6, 4.4, 4.8, 5])
print(reviews) # 1st col is index starting from 0,and 2nd col is data
print("reviews[0]:", reviews[0]) # accessing Series element using index

print("reviews.count():", reviews.count()) # count of non-null elements in Series
print("reviews.mean():", reviews.mean()) # mean of non-null elements in Series
print("reviews.min():", reviews.min()) # min of non-null elements in Series
print("reviews.max():", reviews.max()) # max of non-null elements in Series
print("reviews.std():", reviews.std()) # sample standard deviation of non-null elements in Series

0    4.6
1    4.4
2    4.8
3    5.0
dtype: float64
reviews[0]: 4.6
reviews.count(): 4
reviews.mean(): 4.7
reviews.min(): 4.4
reviews.max(): 5.0
reviews.std(): 0.25819888974716104


## 302. Use Custom indices
- Previously, you've seen that the Pandas Series generates an default index index that starts with 0 and goes till length-1
- Instead of default index, we can use custom index as well
-  we can also use a dict to initialize a Series
- Instead of defining a Series using a list and then passing the index, you can also initialize a Series using a dict where keys will become indices and the values will be the values in the Series
- `pandas.Series.values`
    - returns ndarray of values only for the Series
- `pandas.Series.index`
    - returns Immutable sequence used for indexing and alignment

In [None]:
reviews = pd.Series([4.6, 4.4, 4.8, 5], index=['python', 'java', 'django', 'devops'])
print(reviews)

reviews = pd.Series({'python': 4.6, 'java':4.4, 'django':4.8, 'devops':5}) # keys will become the indices
print(reviews)

print("reviews['python']:", reviews['python']) # access Series elements using custom index
print("reviews.python:", reviews.python) # access Series element using dot operator
print("reviews.java:", reviews.java)
print("reviews.django:", reviews.django)

print(reviews.values) # returns an ndarray with all the values only of Series
print(reviews.index) # returns an immutable sequence used for indexing & alignment

python    4.6
java      4.4
django    4.8
devops    5.0
dtype: float64
python    4.6
java      4.4
django    4.8
devops    5.0
dtype: float64
reviews['python']: 4.6
reviews.python: 4.6
reviews.java: 4.4
reviews.django: 4.8
[4.6 4.4 4.8 5. ]
Index(['python', 'java', 'django', 'devops'], dtype='object')


## 303. Series of String
-  You'll learn how to use String type data within your Pandas Series
- Strings are stored as `object` dtype in pandas
- `pandas.Series.str.upper()`
    - Convert strings in the Series/Index to uppercase.
    - Equivalent to :meth:`str.upper`.
    - returns Series or Index of object
    - returns NaN for non-string elements
- `courses.str.contains(pat, case: 'bool' = True, flags: 'int' = 0, na=None, regex: 'bool' = True,)`
    - Test if pattern or regex is contained within a string of a Series or Index.
    - return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index
    - returns True if it is present, and returns False, if it is not present

In [None]:
courses = pd.Series(['Java', 'Python', 'AWS'])
print(courses)
print(courses.str.upper())
print(courses.str.contains('y'))

0      Java
1    Python
2       AWS
dtype: object
0      JAVA
1    PYTHON
2       AWS
dtype: object
0    False
1     True
2    False
dtype: bool


## 304. Describe
- We'll introduce you a method `describe()` on series
- `pandas.Series.describe(percentiles=None, include=None, exclude=None)`
    -  returns descriptive statitics for the given series of data
    - it'll ignore the `NaN` values
    - it also gives percentiles
    - `Percentile`
        - the percentile rank of a value tells us the percentage of values in a dataset that rank equal to or below agiven value
        - `25th Percentile` :
            - also known as the first, or lower quartile
            - The 25th percentile is the value at which 25% of the answers lie below that value, and 75% of the answers lie above that value

In [3]:
reviews = pd.Series([4.6, 4.4, 4.8, 5])
print(reviews)
print("reviews.describe():\n", reviews.describe())
# only value 4.4 lie below the 25th percentile which is 4.55
# 50% of data is below/above 50th percentile which is 4.70

0    4.6
1    4.4
2    4.8
3    5.0
dtype: float64
reviews.describe():
 count    4.000000
mean     4.700000
std      0.258199
min      4.400000
25%      4.550000
50%      4.700000
75%      4.850000
max      5.000000
dtype: float64


## 305. DataFrame
- DataFrame is an imporved two-dimensional array
- They allow custom row and column indexing
- They have various operations required for data science projects
- Each column in a DataFrame is a Series
- This is a DataFrame of Cricket Players across matches
        a   Kohli   Rohit   Surya   Jadeja
        I1  100     100     77      99
        I2  50      88      110     120
        I3  70      0       0       8
- Each row has different player scores across matches
- DataFrames also handle the missing data, just like Series
-

## 306. Create DataFrame
-