# Multinomial logit

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

from auxiliary.occupation_data import *

***

#### Migration background
According to the SOEP, a second-generation immigrant child is defined as an individual who isborn in Germany and whose mother and father were born abroad (indirect migration background).However, in this paper, we consider also 1.5 generation as in Scḧuller(2012), these 1.5 generation are who have foreign-born parents and are themselves foreign born but arrived in Germany before the age of 7 years (direct migration background). They received the same education system as well as their language skills in German also is same as native German

#### Occupation classification
 In order to estimate the role of parental ethnic identity on immigrant children's occupational choice, we classify the job into professionals, blue-collar and white-collar workers, based on four-digit ISCO88 occupation code.
 Two dimensions of the skill concept are used in the definition of ISCO-88 groups: Skill level, which is a function of the range and complexity of the tasks involved, where the complexity of tasks has priority over the range; and skill-specialisation,which reflects type of knowledge applied, tools and equipment used, materials worked on, or with, and the nature of the goods and services produced. It should be emphasised that the focus in ISCO-88 is on the skills required to carry out the tasks and duties of an occupation and not on whether a worker in a particular occupation is more or less skilled than another worker in the same or other occupations. The design of the German school system places pupils into different secondary schooling tracks at around the age of 10 years and nearly determine the career track of pupils at young age when they choose to Realschule  Hauptschule or Gymnasium. The lower level Hauptschule , designed to prepare pupils for manual professions; the intermediate Realschule , which prepares students for administrative and lower white-collar jobs and finally the upper level Gymnasium (8 years) - the most prestigious - which prepares students for higher education. In this analysis, we will remove the choice of military. The sample consists of people in working age from 17 < age < 65
 
#### Country of origin
We distinguish group of ethnic origin as: native German (i.e., German-born grandparents) and other groups of the second generation: German, Turkish, Eastern Europe, Southern Europe, Anglo, Balkans, France-Benelux, Middle-east, Latin-American-Caribbean, Africa, Asia,  . We consider the father is the head of the household and have the most influence on household income as well as children's career decision. As a result, we distinguish which ethnic group of the second generation of immigrants belong to based on the country of origin of the father.


In [2]:
final_df = pd.read_pickle("Final_data81.pkl")

# 2gen immigrant
immig_2gen = final_df[final_df['gen2_migration']==1]

# native
nat_germ = final_df[final_df['gen2_migration']==0]

## Regression

Now we will take unique people in age between 30 - 50 and remove the choice of military as in this analysis we want to focus career choice on the level of skills of the occupation while military choice has no skills level implication

$$ y_{c} = \beta_0 +\beta_1 gender + \beta_2 country + \beta_3 yearsEducation + \beta_4 occupationalChange + \beta_5 fulltimeExperience +\epsilon $$

In [3]:
#drop military as an occupational choice
final_df = final_df[final_df['bioage'].between(30,50)]
final_df = final_df.drop(final_df[final_df['isco88_choices']=='military'].index)

#take unique id only
final_df = final_df.sort_values(by=['pid','syear'])
final_df.drop_duplicates("pid", keep='last',inplace = True)

In [4]:
co_list= ['Asia','Middle_East','Eastern_Europe','France_Benelux','Anglo','Latin_America_Caribbean','Africa','Scandinavia']
#drop small country sample and no NA fulltime experience
final_df = final_df.drop(final_df[(final_df['forigin_group'].isin(co_list))|(final_df['full_time_experience']<0)].index) 

#### Basic model

As for some group of country origin the sample size is very small, now we just consider 5 big groups: German_native, German_group, South_Europe, Turkish_group, Balkans

In [5]:
fifth_model = final_df[['isco88_choices','forigin_group','male',"years_education",'occupational_change',"full_time_experience"]]
fifth_model = pd.get_dummies(fifth_model, columns=['forigin_group'],drop_first=True)

X_5 = fifth_model.drop(['isco88_choices'], axis=1) 
y_5 = fifth_model['isco88_choices']
#X_5.columns

In [6]:
logit_model5 = sm.MNLogit(y_5,sm.add_constant(X_5))
result5 = logit_model5.fit(cov_type="hc3")
stats5 = result5.summary()
print(stats5)

Optimization terminated successfully.
         Current function value: 1.238326
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:         isco88_choices   No. Observations:                26834
Model:                        MNLogit   Df Residuals:                    26798
Method:                           MLE   Df Model:                           32
Date:                Fri, 08 Jan 2021   Pseudo R-squ.:                  0.1923
Time:                        23:18:12   Log-Likelihood:                -33229.
converged:                       True   LL-Null:                       -41139.
Covariance Type:                  hc3   LLR p-value:                     0.000
         isco88_choices=elementary       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                                  0.4935      0.311    

***

### Second model

Observing that the proportion of professional and technician increase through years while blue-collar jobs descrease, we now have the hypothesis that there might be a shift in career choice under impact of labor environment from 1984 to 2018(more job opportunites in service industry (white collar or professional) than manufacturing (blue-collar)). However, it is the limitation of this dataset that we don't have enough of observation in all group of ethnic through many periods, so in this analysis, we only have two groups of cohort: millenials generation who were born between 1981 to 1996 and who are not millenials. Millenials generation is generally marked by elevated usage of and familiarity with the Internet, mobile devices, and social media, which also have effects on education achievement and career choices.

In [7]:
sixth_model = final_df[['isco88_choices','forigin_group',"millennials",'male',"years_education",'occupational_change',"full_time_experience"]]
sixth_model = pd.get_dummies(sixth_model, columns=['forigin_group'])

sixth_model = cohort_effect_interaction(sixth_model)

X_6 = sixth_model.drop(['isco88_choices',"millennials",'forigin_group_Balkans'], axis=1) 
y_6 = sixth_model['isco88_choices']
#X_6.columns

In [8]:
logit_model6 = sm.MNLogit(y_6,sm.add_constant(X_6))
result6 = logit_model6.fit(cov_type="hc3")
stats6 = result6.summary()
print(stats6)

Optimization terminated successfully.
         Current function value: 1.236279
         Iterations 7
                          MNLogit Regression Results                          
Dep. Variable:         isco88_choices   No. Observations:                26834
Model:                        MNLogit   Df Residuals:                    26778
Method:                           MLE   Df Model:                           52
Date:                Fri, 08 Jan 2021   Pseudo R-squ.:                  0.1936
Time:                        23:18:46   Log-Likelihood:                -33174.
converged:                       True   LL-Null:                       -41139.
Covariance Type:                  hc3   LLR p-value:                     0.000
             isco88_choices=elementary       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------
const                                      0.6284   