[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Naereen/badges)
# SoSciPy
Soscipy is a python library to simplify working with data specially in social sciences. While there are several packages out there, I have personally found it difficult to find out the right library to stick to and the right recepies to use. Unlike other domain where computational methods have seen a rapid growth, social sciences remain a relatively unexplored area. This is first of the 4 tutorials which will explore data analysis in education. 

There are four parts to soscipy:
- **Data Analysis** : Aims to make rapid analysis easy while not compromising on any functionalities and extendability
- **Data Processing** : Makes common actions with structured data easy and accessible without needing expertiese in computer science
- **Data Visualisation** : Rapid visualisations while ensuring that the output is publication quality
- **Utilities** : A set of utilities that you can plug and play to make your workflow easy

### Data Analysis

There are four types of structured datasets that is majorly dealt with in social science:
- Time series
- Microdata
- Geospatial

In this notebook we will see an example of all of these four data types, how to do basic EDA using soscipy and how to do quick visualisations

### 1. Problem statement
We want to analyse the relationship between the countries expenditure on education and their income inequality. We will import data from worldbank using soscipy dataloader. We will enrich the data with some of the common economic indicators and then we will do a regression analysis and plot it to see the relationship between these two indicators.

**Fetching data**
- We will visit the World Bank data page and look for the datafile. Use this URL: https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?view=chart

In [48]:
!pip install --upgrade soscipy

Collecting soscipy
  Downloading soscipy-0.0.18-py3-none-any.whl (11 kB)
Installing collected packages: soscipy
  Attempting uninstall: soscipy
    Found existing installation: soscipy 0.0.17
    Uninstalling soscipy-0.0.17:
      Successfully uninstalled soscipy-0.0.17
Successfully installed soscipy-0.0.18


In [49]:
import pandas as pd
from soscipy.process import dfops

In [50]:
f1 = '/Users/saurabhkarn/PycharmProjects/kornect/test_data/rangin_justicehub-file.xlsx'
f2 = '/Users/saurabhkarn/PycharmProjects/kornect/test_data/gyan_data.csv'

In [51]:
df1 = pd.read_excel(f1)
df2 = pd.read_csv(f2)

In [54]:
def combine(df1, df2, outer=True):
    """
    Combines two dataframe after identifying its primary key
    :param df1: Dataframe1
    :param df2: Dataframe2
    :param outer: bool, if set True will return outer join of the dataset
    :return: a joint dataframe
    """
    left_on, right_on = get_primary_keys(df1, df2)
    list1 = list(df1[df1.columns[left_on]])
    list2 = list(df2[df2.columns[right_on]])
    primary_key_joins = string_matcher(list1, list2)
    matched_list = primary_key_joins.get_matched_list()
    matched_list = matched_list[matched_list.similairity < 0.99]
    df2[df2.columns[right_on]] = df2[df2.columns[right_on]].apply(lambda x: lookup(x, matched_list))
    if outer:
        temp = pd.merge(df1, df2, left_on=df1.columns[left_on], right_on=df2.columns[right_on], how='outer')
    else:
        temp = pd.merge(df1, df2, left_on=df1.columns[left_on], right_on=df2.columns[right_on])
        temp = temp.drop([df2.columns[right_on]], axis=1)
    return temp

In [55]:
temp = combine(df1,df2)

NameError: name 'get_primary_keys' is not defined

In [52]:
temp = dfops.combine(df1,df2)

In [53]:
temp

Unnamed: 0,Judges,Date of Appointment,Whether died in office,Whether resigned from office,Date of Birth_x,Intended Date of Retirement,Cadre of Appointment,Parent High Court,Appointing Authority,Gender_x,...,"If yes, what type",Area of Practice 1 (in order mentioned in profile),Area of Practice 2,Area of Practice 3,Area of Practice 4,Area of Practice 5,Area of Practice 6,Area of Practice 7,Area of Practice 8,Area of Practice 9
0,Harilal Jekisundas Kania,1946-06-20,Yes,No,03-11-1890,1955-11-02,Hc-Bar,Bombay,british,Male,...,,,,,,,,,,
1,Sir Saiyid Fazl Ali,1947-06-09,No,No,19-09-1886,1951-09-18,Hc-Bar,Patna,british,Male,...,,,,,,,,,,
2,M. Patanjali Sastri,1947-12-06,No,No,04-01-1889,1954-01-03,Hc-Bar,Madras,executive,Male,...,,,,,,,,,,
3,Mehr Chand Mahajan,1948-10-04,No,No,23-12-1889,1954-12-22,Hc-Bar,Lahore,executive,Male,...,Constitutional Adviser to His Highness the Mah...,Civil,Constitution,,,,,,,
4,Bijan Kumar Mukherjea,1948-10-14,No,Yes,15-08-1891,1956-08-14,Hc-Bar,Calcutta,executive,Male,...,Senior Government Pleader,Publication Problems Law,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,,NaT,,,,NaT,,,,,...,State Solicitor,Labour Laws,Service matters,,,,,,,
407,,NaT,,,,NaT,,,,,...,,Civil,Commercial,Arbitration,Constitutional,,,,,
408,,NaT,,,,NaT,,,,,...,Standing Counsel of the Income Tax Department ...,Constitutional,Company,Service,Educational,Taxation,,,,
409,,NaT,,,,NaT,,,,,...,Central Government Standing Counsel,Civil,Criminal,Constitutional,Revenue,Service,,,,
