___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:120%; text-align:center; border-radius:10px 10px;">Way to Reinvent Yourself</p>

<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Working with Text & Time Data</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [WORKING WITH TEXT DATA](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#00)
* [WORKING WITH TIME DATA](#1)
    * [String Methods](#1.1)
    * [Most Usefull String Methods](#1.2)
    * [Dummy Operations](#1.3)
* [WORKING WITH TIME DATA](#2)
    * [pd.to_datetime()](#2.1)
    * [Series.dt()](#2.2)
    * [Datetime Module](#2.3)
    * [Series.dt()](#2.4)
* [OPERATION WITH DATETIME OBJECT](#3)
* [THE END OF THE SESSION](#4)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="00"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Text Data</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

In this notebook, we will first discuss the string operations with our basic Series/Index and learn how to apply these string functions on the DataFrame.

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions [Refer To Official Python Documentation]( https://docs.python.org/3/library/stdtypes.html#string-methods). So, while studying with the Series Object, convert it to String Object and then perform the operation.

In addition, according to [Pandas Official Document](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), there are two ways to store text data in pandas:
- object -dtype NumPy array.
- StringDtype extension type.

Pandas recommend using StringDtype to store text data.

[SOURCE01](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), [SOURCE02](https://www.w3schools.com/python/python_ref_string.asp)

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">String Methods</p>

<a id="1.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Strings implement all of the common sequence operations, along with the additional methods described at [the official documentation](https://docs.python.org/3/library/stdtypes.html#string-methods).

Strings also support two styles of string formatting, one providing a large degree of flexibility and customization (**Please see the information about** [str.format()](https://docs.python.org/3/library/stdtypes.html#str.format), [Format String Syntax](https://docs.python.org/3/library/string.html#formatstrings) and [Custom String Formatting](https://docs.python.org/3/library/string.html#string-formatting)) and the other based on C printf style formatting that handles a narrower range of types and is slightly harder to use correctly, but is often faster for the cases it can handle ([printf-style String Formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)).

The [Text Processing Services](https://docs.python.org/3/library/text.html#textservices) section of the standard library covers a number of other modules that provide various text related utilities (including regular expression support in the [re](https://docs.python.org/3/library/re.html#module-re) module).

Please watch [**``Video Source``**](https://www.youtube.com/watch?v=6JNwK6hEneg) for enhancing your understanding of working with Text Data in Pandas.  

**What are these String Methods? Now let us examine some of the most common and usefull String Methods and dig into them one by one:**

In [2]:
df = pd.read_excel("text_exercise.xlsx")
df

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Most Usefull String Methods</p>

<a id="1.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **str.lower() =>** Converts a string into lower case
- **str.upper() =>** Converts a string into upper case
- **str.capitalize() =>** Converts the first character to upper case
- **str.title() =>** Converts the first character of each word to upper case
- **str.swapcase() =>** Swaps the case lower/upper

[SOURCE01](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm)
[SOURCE02](https://www.aboutdatablog.com/post/10-most-useful-string-functions-in-pandas)
[SOURCE03](https://towardsdatascience.com/5-must-know-pandas-operations-on-strings-4f88ca6b8e25)
[SOURCE04](https://towardsdatascience.com/pandas-string-operations-explained-fdfab7602fb4)
[SOURCE05](https://blog.devgenius.io/string-operations-on-pandas-dataframe-88af220439d1)
[SOURCE06](https://www.geeksforgeeks.org/string-manipulations-in-pandas-dataframe/)

___

___

- **str.isalpha()     =>** Returns True if all characters in the string are in the alphabet
- **str.isnumeric()   =>** Returns True if all characters in the string are numeric
- **str.isalnum()     =>** Returns True if all characters in the string are alphanumeric
- **str.endswith()	  =>** Returns true if the string ends with the specified value
- **str.startswith()  =>** Returns true if the string starts with the specified value
- **str.contains()	  =>** Returns a Boolean value True for each element if the substring contains in the element, else False.

[SOURCE01](https://careerkarma.com/blog/python-isalpha-isnumeric-isalnum/)
[SOURCE02](https://careerkarma.com/blog/python-startswith-and-endswith/)
[SOURCE03](https://www.geeksforgeeks.org/python-startswith-endswidth-function/)
[SOURCE04](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.)

___

**isalpha()** Function in pandas python checks whether the string consists of alphabetic characters only. It returns True when alphabetic value is present and it returns False when the alphabetic value is not present.

**isnumeric()** checks whether all characters in each string are numeric. This is equivalent to running the Python string method str. isnumeric() for each element of the Series/Index.

**isalnum()** Function in python checks whether the string consists of alphanumeric characters. It returns True when alphanumeric value is present and it returns False when the alphanumeric value is not present. Alphanumeric means a character that is either a letter or a number.

Pandas **startswith()** tests if the start of each string element matches a pattern. It is yet another method to search and filter text data in Series or Data Frame. This method is Similar to Python’s startswith() method, but has different parameters and it works on Pandas objects only. Hence .str has to be prefixed everytime before calling this method, so that the compiler knows that it’s different from default function.

Pandas **endswith()** method is a built-in function that determines whether the given string ends with a specific sequence of characters.

The **contains()** method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not [SOURCE](https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852#:~:text=The%20contains%20method%20in%20Pandas,str.).

we can use these string methods which returning boolean expression for creating condition and so selecting relative rows

___

- **str.strip()	=>** Returns a trimmed version of the string

- **str.replace() =>** Returns a string where a specified value is replaced with a specified value

- **str.split()	=>** Splits the string at the specified separator, and returns a list

- **str.find()	=>** Searches the string for a specified value and returns the position of where it was found

- **str.findall()	=>** Returns a list of all occurrence of the pattern.

- **str.join()	=>** Converts the elements of an iterable into a string

___

**NOTE:** For a better using and understanding of strip, please revise escape characters in python [Source01 for Escape Characters](https://www.python-ds.com/python-3-escape-sequences) & [Source02 for Escape Characters](https://www.w3schools.com/python/gloss_python_escape_characters.asp)

### ``str.replace()`` vs **``.replace()``

- **Purpose:** Use **str.replace** for substring replacements on a single string column, and **replace** for any general replacement on one or more columns.

- **Usage:** **str.replace** can replace one thing at a time. **replace** lets you perform multiple independent replacements, i.e., replace many things at once.

- **Default behavior:** **str.replace** enables regex replacement by default. **replace** only performs a full match unless the regex=True switch is used.

**Indexing with .str[]** 

You can use [] notation to directly index by position locations [SOURCE](https://pandas.pydata.org/pandas-docs/version/0.15/text.html). 

In [None]:
df.job

**str.find** returns lowest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

**str.rfind** returns highest indexes in each strings in the Series/Index. Each of returned indexes corresponds to the position where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

**str.findall** finds all occurrences of pattern or regular expression in the Series/Index [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.findall.html).

In [None]:
df["skills"] = [[],["Java","C++"],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df["Skills"] = [[],[],["Python","Tableau","SQL"],[],["React","Django"],["JavaScript","Python"],["R","SQL"],["SQL","Python"]]
df.loc[1, "Skills"] = "Java,C++"
df

If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the function. This function is an equivalent to str.join() [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html).

**Join** lists contained as elements in the Series/Index with passed delimiter.

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Dummy Operations</p>

<a id="1.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

A dataset may contain various type of values, sometimes it consists of categorical values. So, in-order to use those categorical value for programming efficiently we create dummy variables. A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value [SOURCE](https://www.geeksforgeeks.org/how-to-create-dummy-variables-in-python-with-pandas/).

### get_dummies()

**Syntax1:** ``pd.get_dummies(data, prefix=None, prefix_sep="_",)``<br>
            **OR**<br>
**Syntax2:** ``df["col_name"].get_dummies(sep = ",")``

**Parameters:**
- data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
- prefix= Initial value
- prefix_sep= Data values separation.
- Return Type: Dummy variables.

In [4]:
ser = pd.Series(["p|q","p","p|r"])
ser

0    p|q
1      p
2    p|r
dtype: object

In [5]:
ser.str.get_dummies()

Unnamed: 0,p,q,r
0,1,1,0
1,1,0,0
2,1,0,1


As you can see two(2) dummy variables are created for the three categorical values of the "department" attribute. We can create dummy variables in python using **``get_dummies()``** method.

Dummies with **``drop_first=True``** parameter can be used to drop the first column. drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable creation. Hence it reduces the correlations created among dummy variables. In other words it drops the first dummy to avoid the creation of correlated features [SOURCE](https://stackoverflow.com/questions/63661560/drop-first-true-during-dummy-variable-creation-in-pandas#:~:text=1%20Answer,correlations%20created%20among%20dummy%20variables.).

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Time Data</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

As someone who works with time series data on almost a daily basis, it's clear that the pandas Python package is extremely useful for time series manipulation and analysis. This basic introduction to time series data manipulation with pandas should allow you to get started in your time series analysis. Specific objectives are to show you how to:
- create a date range
- work with timestamp data
- convert string data to a timestamp
- index and slice your time series data in a data frame
- resample your time series for different time period aggregates/summary statistics
- compute a rolling statistic such as a rolling average
- work with missing data
- understand the basics of unix/epoch time
- understand common pitfalls of time series data analysis [SOURCE](https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea)

In this section, we will introduce how to work with each of these types of date/time data in Pandas. This short section is by no means a complete guide to the time series tools available in Python or Pandas, but instead is intended as a broad overview of how you as a user should approach working with time series [SOURCE](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html).

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">pd.to_datetime()</p>

<a id="2.1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

For more and detailed information about to_datetime() metod, please [Visit Official Document](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

**``pd.to_datetime()``** Converts argument to datetime.

This function converts a **``scalar``**, **``array-like``**, **``Series``** or **``DataFrame/dict-like``** to a pandas datetime object.

**As stated above, many input types are supported, and lead to different output types:**

- **``scalars``** can be int, float, str, datetime object (from stdlib datetime module or numpy). They are converted to Timestamp when possible, otherwise they are converted to datetime.datetime. None/NaN/null scalars are converted to NaT.

- **``array-like``** can contain int, float, str, datetime objects. They are converted to DatetimeIndex when possible, otherwise they are converted to Index with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``Series``** are converted to Series with datetime64 dtype when possible, otherwise they are converted to Series with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

- **``DataFrame/dict-like``** are converted to Series with datetime64 dtype. For each row a datetime is created from assembling the various dataframe columns. Column keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

[Special Note :](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html)

As many data sets do contain datetime information in one of the columns, pandas input function like [pandas.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) and [pandas.read_json()](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html#pandas.read_json) can do the transformation to dates when reading the data using the **``parse_dates parameter``** with a list of the columns to read as Timestamp.

Why are these **``pandas.Timestamp``** objects useful? Let's illustrate the added value with some example cases. In this sense, let us assume that we want to work with the dates in the column datetime as datetime objects instead of plain text:

In [7]:
df = pd.read_csv("time_exercise.csv")
df

Unnamed: 0,id_product,order_date,product_quantity,product_price,entry_date
0,401,2021-01-23,1.0,541.487603,2018-12-04
1,416,2020-04-02,1.0,131.181818,2018-12-04
2,717,2019-03-10,1.0,2035.488500,2018-12-04
3,778,2019-12-27,1.0,335.988000,2018-12-04
4,826,2020-02-19,1.0,342.292302,2018-12-04
...,...,...,...,...,...
906,1536842,2020-11-24,1.0,1186.776860,2020-10-07
907,1536842,2020-11-24,1.0,1186.776860,2020-10-07
908,1536887,2020-11-22,1.0,0.000000,2020-11-13
909,1536952,2021-01-26,1.0,988.429752,2020-11-24


Initially, the values in datetime are character strings and do **NOT** provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. ``datetime64[ns, UTC]``) objects. In pandas we call these datetime objects similar to datetime.datetime from the standard library as [pandas.Timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp).

In [8]:
pd.to_datetime(df["order_date"])

0     2021-01-23
1     2020-04-02
2     2019-03-10
3     2019-12-27
4     2020-02-19
         ...    
906   2020-11-24
907   2020-11-24
908   2020-11-22
909   2021-01-26
910   2020-12-06
Name: order_date, Length: 911, dtype: datetime64[ns]

In [9]:
df["entry_date"] = pd.to_datetime(df["entry_date"])
df["order_date"] = pd.to_datetime(df["order_date"])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id_product        911 non-null    int64         
 1   order_date        911 non-null    datetime64[ns]
 2   product_quantity  911 non-null    float64       
 3   product_price     911 non-null    float64       
 4   entry_date        911 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1)
memory usage: 35.7 KB


In [11]:
df.entry_date.min()

Timestamp('2018-12-04 00:00:00')

In [12]:
df.entry_date.max()

Timestamp('2020-11-26 00:00:00')

In [13]:
df.entry_date.max() - df.entry_date.min()

Timedelta('723 days 00:00:00')

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Series.dt()</p>

<a id="2.2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Accessor object for datetimelike properties of the Series values [SOURCE](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html).

For a comprehensive information what the datetimelike properties, please visit [Official Pandas API Reference Document](https://pandas.pydata.org/pandas-docs/version/0.22/api.html#datetimelike-properties)

In [14]:
df["entry_date"].dt.year

0      2018
1      2018
2      2018
3      2018
4      2018
       ... 
906    2020
907    2020
908    2020
909    2020
910    2020
Name: entry_date, Length: 911, dtype: int32

### <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:LEFT; border-radius:10px 10px;">Datetime Module</p>

<a id="2.3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

The datetime module supplies classes for manipulating dates and times [SOURCE](https://docs.python.org/3/library/datetime.html).

### ``class datetime.datetime``

A combination of a date and a time. Attributes: year, month, day, hour, minute, second, microsecond, and tzinfo.

In [15]:
from datetime import datetime

### ``class datetime.timedelta``

A duration expressing the difference between two date, time, or datetime instances to microsecond resolution [SOURCE](https://www.geeksforgeeks.org/manipulate-date-and-time-with-the-datetime-module-in-python/).

In [90]:
from datetime import timedelta

### ``strftime()``

**Converting** from date/datetime/timedelta object **to string type** [SOURCE](https://strftime.org/)

### strptime()

**Converting** from string type **to datetime object**

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Operation with Datetime Object</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Let's detect the time between first order date and entry date for each product

**Let us do it by string methods**

## Let's detect the time between last order date and today for each product

**This time, let us do it by datetime properties**

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of Session</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>